BrowseComp

BrowseComp: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

15rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score

Showing 3 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Score Model Match Provenance Sampled
1 Gemini 3.1 Pro Preview 85.9% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-28
2 GPT-5.5 84.4% GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-28
3 Claude Opus 4.8 84.3% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
4 Claude Opus 4.7 79.8% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
1 GPT-5.5 Pro 90.1% GPT-5.5 Pro
openai-gpt-5.5-pro
Launch post 2026-04-23
2 GPT-5.4 Pro 89.3% GPT-5.4 Pro
openai-gpt-5.4-pro
Launch post 2026-04-23
3 Gemini 3.1 Pro Preview 85.9% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-23
4 GPT-5.5 84.4% GPT-5.5
openai-gpt-5.5
Launch post 2026-04-23
5 GPT-5.4 82.7% GPT-5.4
openai-gpt-5.4
Launch post 2026-04-23
6 Claude Opus 4.7 79.3% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-23
1 GPT-5.4 Pro 89.3% GPT-5.4 Pro
openai-gpt-5.4-pro
Launch post 2026-04-16
2 Claude Mythos Preview 86.9% Claude Mythos Preview
anthropic-claude-mythos-preview
Launch post 2026-04-16
3 Gemini 3.1 Pro Preview 85.9% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-16
4 Claude Opus 4.6 83.7% Claude Opus 4.6
anthropic-claude-opus-4.6
Launch post 2026-04-16
5 Claude Opus 4.7 79.3% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-16