BrowseComp Long Context 128k

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

5rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.2 0.92 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
2 GPT-5.1 0.90 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
2 GPT-5.1 Instant 0.90 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
2 GPT-5.1 Thinking 0.90 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
2 GPT-5 0.90 GPT-5
openai-gpt-5
Self-reported 2026-05-06