WebArena

BrowserGym leaderboard slice for WebArena, evaluating autonomous web agents across realistic browser tasks.

12rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 GenericAgent-Claude-3.7-Sonnet 44.60 Imported 2026-05-06
2 A3-Qwen3.5-9B 42.10 Imported 2026-05-06
3 OrbyAgent-Claude-3.5-Sonnet 36.50 Imported 2026-05-06
4 GenericAgent-Claude-3.5-Sonnet 36.20 Imported 2026-05-06
5 OrbyAgent-ActIO-72b 34.70 Imported 2026-05-06
6 GenericAgent-GPT-4o 31.40 Imported 2026-05-06
7 GenericAgent-GPT-4.1-Mini 30.70 Imported 2026-05-06
8 GenericAgent-GPT-o1-mini 28.60 Imported 2026-05-06
9 GenericAgent-Llama-3.1-405b 24 Imported 2026-05-06
10 GenericAgent-AgentTrek-1.0-32b 22.40 Imported 2026-05-06
11 GenericAgent-Llama-3.1-70b 18.40 Imported 2026-05-06
12 GenericAgent-GPT-4o-mini 17.40 Imported 2026-05-06