AgentRewardBench

Benchmark for evaluating reward models and judge systems on agent trajectories from AssistantBench, VisualWebArena, WebArena, WorkArena, and WorkArena++.

16rows
overallprimary metric
2026-05-06sampled

Metadata

Metrics

Overall, Recall, F1, AssistantBench, VisualWebArena, WebArena, WorkArena, WorkArena++

Latest Results

Rows ranked by Overall score, matching the Space app sorting. Project and log URLs are retained in metadata.

Rank Subject Overall Model Match Provenance Sampled
1 Rule-based 83.80 Imported 2026-05-06
2 WebJudge (o4-mini) 82 Imported 2026-05-06
3 WebJudge-7B 75.70 Imported 2026-05-06
4 World-State-Model-7B 71.20 Imported 2026-05-06
5 GPT-4o (A) 69.80 Imported 2026-05-06
6 Claude 3.7 S. (S) 69.40 Imported 2026-05-06
7 Claude 3.7 S. (A) 68.80 Imported 2026-05-06
8 GPT-4o (S) 68.10 Imported 2026-05-06
9 AER-C (GPT-4o) 67.70 Imported 2026-05-06
10 Llama 3.3 (A) 67.70 Imported 2026-05-06
11 AER-V (GPT-4o) 67.60 Imported 2026-05-06
12 GPT-4o Mini (S) 64.50 Imported 2026-05-06
13 Qwen2.5-VL (S) 64.50 Imported 2026-05-06
14 Qwen2.5-VL (A) 64.30 Imported 2026-05-06
15 GPT-4o Mini (A) 61.50 Imported 2026-05-06
16 NNetNav (Llama-3.3 70B) 52.50 Imported 2026-05-06