BrowseComp Long Context 256k
BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
2rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.2 | 0.90 | GPT-5.2 openai-gpt-5.2 | Self-reported | 2026-05-06 |
| 2 | GPT-5 | 0.89 | GPT-5 openai-gpt-5 | Self-reported | 2026-05-06 |
No matching rows.