AssistantBench

BrowserGym leaderboard slice for AssistantBench web-assistance tasks requiring multi-step information seeking and tool use.

6rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 GenericAgent-GPT-o1-mini 6.90 Imported 2026-05-06
2 GenericAgent-Claude-3.5-Sonnet 5.20 Imported 2026-05-06
3 GenericAgent-GPT-4o 4.80 Imported 2026-05-06
4 GenericAgent-Llama-3.1-405b 3.90 Imported 2026-05-06
5 GenericAgent-Llama-3.1-70b 2.80 Imported 2026-05-06
6 GenericAgent-GPT-4o-mini 2.10 Imported 2026-05-06