DEEPSYNTH

Deep information-synthesis benchmark for agents that must gather, browse, extract, and reason over multiple sources to produce structured answers.

12rows
f1primary metric
2026-05-27sampled

Metadata

Metrics

F1, Precision, Recall, Exact Match, LLM Judge

Latest Results

Rows parsed from the DEEPSYNTH project page Main Results table. The live Hugging Face leaderboard is linked by the source, but this snapshot uses the stable published table.

Rank Subject F1 Model Match Provenance Sampled
1 o3-deep-research 8.97 Imported 2026-05-27
2 GPT-5.2-Pro 8.70 GPT-5.2 Pro
openai-gpt-5.2-pro
Imported 2026-05-27
3 Smolagent (GPT-5) 6.42 Imported 2026-05-27
4 Gemini-Pro-2.5 6.25 Imported 2026-05-27
5 OWL (GPT-4.1) 5.41 Imported 2026-05-27
6 GPT-5.1 3.83 GPT-5.1
openai-gpt-5.1
Imported 2026-05-27
7 Smolagent (GPT-4.1) 3.75 Imported 2026-05-27
8 GPT-4.1 3.46 GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
9 o3 3.29 o3
openai-o3
Imported 2026-05-27
10 DeepSeek-R1-Chat 3.23 Imported 2026-05-27
11 o4-mini 3.05 o4 Mini
openai-o4-mini
Imported 2026-05-27
12 DeepSeek-R1-Reasoner 2.80 Imported 2026-05-27