WebChoreArena
Reproducible browser-agent benchmark of 532 tedious web tasks extending WebArena with massive-memory, calculation, and long-term-memory chores.
6rows
overall_success_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Overall Success Rate, Shopping Success Rate, Admin Success Rate, Reddit Success Rate, GitLab Success Rate, Cross-Site Success Rate
| Rank | Subject | Overall Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | BrowserGym + Gemini 2.5 Pro (preview-03-25) | 44.9% | — | Imported | 2026-05-28 |
| 2 | AgentOccam + Gemini 2.5 Pro (preview-03-25) | 37.8% | — | Imported | 2026-05-28 |
| 3 | AgentOccam + Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) | 23.5% | — | Imported | 2026-05-28 |
| 4 | BrowserGym + Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) | 23.1% | — | Imported | 2026-05-28 |
| 5 | AgentOccam + GPT-4o (2024-05-13) | 6.8% | — | Imported | 2026-05-28 |
| 6 | BrowserGym + GPT-4o (2024-05-13) | 2.6% | — | Imported | 2026-05-28 |
No matching rows.