WebArena
BrowserGym leaderboard slice for WebArena, evaluating autonomous web agents across realistic browser tasks.
12rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Std. Err. (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GenericAgent-Claude-3.7-Sonnet | 44.60 | — | Imported | 2026-05-06 |
| 2 | A3-Qwen3.5-9B | 42.10 | — | Imported | 2026-05-06 |
| 3 | OrbyAgent-Claude-3.5-Sonnet | 36.50 | — | Imported | 2026-05-06 |
| 4 | GenericAgent-Claude-3.5-Sonnet | 36.20 | — | Imported | 2026-05-06 |
| 5 | OrbyAgent-ActIO-72b | 34.70 | — | Imported | 2026-05-06 |
| 6 | GenericAgent-GPT-4o | 31.40 | — | Imported | 2026-05-06 |
| 7 | GenericAgent-GPT-4.1-Mini | 30.70 | — | Imported | 2026-05-06 |
| 8 | GenericAgent-GPT-o1-mini | 28.60 | — | Imported | 2026-05-06 |
| 9 | GenericAgent-Llama-3.1-405b | 24 | — | Imported | 2026-05-06 |
| 10 | GenericAgent-AgentTrek-1.0-32b | 22.40 | — | Imported | 2026-05-06 |
| 11 | GenericAgent-Llama-3.1-70b | 18.40 | — | Imported | 2026-05-06 |
| 12 | GenericAgent-GPT-4o-mini | 17.40 | — | Imported | 2026-05-06 |
No matching rows.