WebChoreArena

Reproducible browser-agent benchmark of 532 tedious web tasks extending WebArena with massive-memory, calculation, and long-term-memory chores.

6rows
overall_success_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Overall Success Rate, Shopping Success Rate, Admin Success Rate, Reddit Success Rate, GitLab Success Rate, Cross-Site Success Rate

Latest Results

Rows are imported from the official WebChoreArena README final-results table.

Rank Subject Overall Success Rate Model Match Provenance Sampled
1 BrowserGym + Gemini 2.5 Pro (preview-03-25) 44.9% Imported 2026-05-28
2 AgentOccam + Gemini 2.5 Pro (preview-03-25) 37.8% Imported 2026-05-28
3 AgentOccam + Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) 23.5% Imported 2026-05-28
4 BrowserGym + Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) 23.1% Imported 2026-05-28
5 AgentOccam + GPT-4o (2024-05-13) 6.8% Imported 2026-05-28
6 BrowserGym + GPT-4o (2024-05-13) 2.6% Imported 2026-05-28