DEEPSYNTH
Deep information-synthesis benchmark for agents that must gather, browse, extract, and reason over multiple sources to produce structured answers.
12rows
f1primary metric
2026-05-27sampled
Metadata
Metrics
F1, Precision, Recall, Exact Match, LLM Judge
| Rank | Subject | F1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3-deep-research | 8.97 | — | Imported | 2026-05-27 |
| 2 | GPT-5.2-Pro | 8.70 | GPT-5.2 Pro openai-gpt-5.2-pro | Imported | 2026-05-27 |
| 3 | Smolagent (GPT-5) | 6.42 | — | Imported | 2026-05-27 |
| 4 | Gemini-Pro-2.5 | 6.25 | — | Imported | 2026-05-27 |
| 5 | OWL (GPT-4.1) | 5.41 | — | Imported | 2026-05-27 |
| 6 | GPT-5.1 | 3.83 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-27 |
| 7 | Smolagent (GPT-4.1) | 3.75 | — | Imported | 2026-05-27 |
| 8 | GPT-4.1 | 3.46 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-27 |
| 9 | o3 | 3.29 | o3 openai-o3 | Imported | 2026-05-27 |
| 10 | DeepSeek-R1-Chat | 3.23 | — | Imported | 2026-05-27 |
| 11 | o4-mini | 3.05 | o4 Mini openai-o4-mini | Imported | 2026-05-27 |
| 12 | DeepSeek-R1-Reasoner | 2.80 | — | Imported | 2026-05-27 |
No matching rows.