BrowseComp Long Context 256k

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

2rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.2 0.90 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
2 GPT-5 0.89 GPT-5
openai-gpt-5
Self-reported 2026-05-06