BrowseComp
BrowseComp: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
15rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score
Showing 3 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 85.9% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
| 2 | GPT-5.5 | 84.4% | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.8 | 84.3% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.7 | 79.8% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 1 | GPT-5.5 Pro | 90.1% | GPT-5.5 Pro openai-gpt-5.5-pro | Launch post | 2026-04-23 |
| 2 | GPT-5.4 Pro | 89.3% | GPT-5.4 Pro openai-gpt-5.4-pro | Launch post | 2026-04-23 |
| 3 | Gemini 3.1 Pro Preview | 85.9% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-23 |
| 4 | GPT-5.5 | 84.4% | GPT-5.5 openai-gpt-5.5 | Launch post | 2026-04-23 |
| 5 | GPT-5.4 | 82.7% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-23 |
| 6 | Claude Opus 4.7 | 79.3% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-23 |
| 1 | GPT-5.4 Pro | 89.3% | GPT-5.4 Pro openai-gpt-5.4-pro | Launch post | 2026-04-16 |
| 2 | Claude Mythos Preview | 86.9% | Claude Mythos Preview anthropic-claude-mythos-preview | Launch post | 2026-04-16 |
| 3 | Gemini 3.1 Pro Preview | 85.9% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-16 |
| 4 | Claude Opus 4.6 | 83.7% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Launch post | 2026-04-16 |
| 5 | Claude Opus 4.7 | 79.3% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-16 |
No matching rows.