AgentRewardBench
Benchmark for evaluating reward models and judge systems on agent trajectories from AssistantBench, VisualWebArena, WebArena, WorkArena, and WorkArena++.
16rows
overallprimary metric
2026-05-06sampled
Metadata
Metrics
Overall, Recall, F1, AssistantBench, VisualWebArena, WebArena, WorkArena, WorkArena++
| Rank | Subject | Overall | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Rule-based | 83.80 | — | Imported | 2026-05-06 |
| 2 | WebJudge (o4-mini) | 82 | — | Imported | 2026-05-06 |
| 3 | WebJudge-7B | 75.70 | — | Imported | 2026-05-06 |
| 4 | World-State-Model-7B | 71.20 | — | Imported | 2026-05-06 |
| 5 | GPT-4o (A) | 69.80 | — | Imported | 2026-05-06 |
| 6 | Claude 3.7 S. (S) | 69.40 | — | Imported | 2026-05-06 |
| 7 | Claude 3.7 S. (A) | 68.80 | — | Imported | 2026-05-06 |
| 8 | GPT-4o (S) | 68.10 | — | Imported | 2026-05-06 |
| 9 | AER-C (GPT-4o) | 67.70 | — | Imported | 2026-05-06 |
| 10 | Llama 3.3 (A) | 67.70 | — | Imported | 2026-05-06 |
| 11 | AER-V (GPT-4o) | 67.60 | — | Imported | 2026-05-06 |
| 12 | GPT-4o Mini (S) | 64.50 | — | Imported | 2026-05-06 |
| 13 | Qwen2.5-VL (S) | 64.50 | — | Imported | 2026-05-06 |
| 14 | Qwen2.5-VL (A) | 64.30 | — | Imported | 2026-05-06 |
| 15 | GPT-4o Mini (A) | 61.50 | — | Imported | 2026-05-06 |
| 16 | NNetNav (Llama-3.3 70B) | 52.50 | — | Imported | 2026-05-06 |
No matching rows.