Vending-Bench 2

Long-horizon autonomous agent benchmark measuring how well models operate a simulated vending-machine business over an extended period.

45rows
final_account_valueprimary metric
2026-05-28sampled

Metadata

Metrics

Final Account Value, Profit/Loss, Initial Account Value, Epochs, Observations, Minimum Account Value, Maximum Account Value

Showing 2 latest source slices.

Latest Results

Rows ranked by highest final account value.

Rank Subject Final Account Value Model Match Provenance Sampled
1 Claude Opus 4.7 10936.76 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
2 Claude Opus 4.6 8017.59 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
3 GPT-5.5 7523.84 GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
4 Claude Sonnet 4.6 7204.14 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
5 Kimi K2.6 6204.57 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
6 GPT-5.4 6144.18 Imported 2026-05-28
7 GPT-5.3-Codex 5940.12 GPT-5.3-Codex
openai-gpt-5.3-codex
Imported 2026-05-28
8 Claude Opus 4.8 - High 5787.43 Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
9 GLM-5.1 5634.41 GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
10 Gemini 3 Pro 5478.16 Gemini 3
google-gemini-3
Imported 2026-05-28
11 Gemini 3.5 Flash 5396.42 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
12 Qwen 3.6 Plus 5114.87 Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
13 Claude Opus 4.5 4967.06 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
14 Grok 4.20 4662.85 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
15 GLM-5 4432.12 GLM GLM 5
z-ai-glm-5
Imported 2026-05-28
16 Qwen 3.6 Max 4254.19 Imported 2026-05-28
17 Claude Sonnet 4.5 3838.74 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
18 Gemini 3.1 Pro Custom Tools 3774.25 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
19 Gemini 3 Flash 3634.72 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
20 GPT-5.2 3591.33 GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
21 Deepseek V4 Pro 3284.52 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
22 Claude Opus 4.8 - Max 2992.34 Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
23 GLM-4.7 2376.82 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
24 GPT-5.1 1473.43 GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
25 Kimi K2.5 1198.46 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
26 Grok 4.1 Fast 1106.63 GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
27 DeepSeek-V3.2 1034.00 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-28
28 Gemini 3.1 Pro 911.21 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
29 Gemini 2.5 Pro 573.64 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
30 Gemini 2.5 Flash 548.84 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-28
31 Qwen 3.5 35B A3B 462.69 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Imported 2026-05-28
32 Claude Haiku 4.5 458.89 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
33 Qwen 3.5 27B 201.98 Qwen3.5-27B
qwen-qwen3.5-27b
Imported 2026-05-28
34 MiniMax-M2 160.60 MiniMax M2
minimax-minimax-m2
Imported 2026-05-28
35 Qwen3 Max 71.57 Qwen3 Max
qwen-qwen3-max
Imported 2026-05-28
36 Grok 4.3 35.26 GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
37 Qwen 3.5 Plus 0.54 Qwen3.5 Plus 2026-04-20
qwen-qwen3.5-plus-20260420
Imported 2026-05-28
38 Qwen3 235B A22B Thinking -11.34 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-28
39 GPT-OSS-120b -21.53 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
40 MiniMax-M2.5 -23.16 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-28
41 GPT-5 mini -31.18 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
1 Claude Opus 4.7 (max effort) 10937 USD Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
2 Claude Opus 4.7 (high effort) 7971 USD Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
3 Claude Opus 4.8 (high effort) 5787.4 USD Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
4 Claude Opus 4.8 (max effort) 2992.3 USD Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28