Vending-Bench 2
Long-horizon autonomous agent benchmark measuring how well models operate a simulated vending-machine business over an extended period.
45rows
final_account_valueprimary metric
2026-05-28sampled
Metadata
Metrics
Final Account Value, Profit/Loss, Initial Account Value, Epochs, Observations, Minimum Account Value, Maximum Account Value
Showing 2 latest source slices.
| Rank | Subject | Final Account Value | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 10936.76 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 2 | Claude Opus 4.6 | 8017.59 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-28 |
| 3 | GPT-5.5 | 7523.84 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 4 | Claude Sonnet 4.6 | 7204.14 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 5 | Kimi K2.6 | 6204.57 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 6 | GPT-5.4 | 6144.18 | — | Imported | 2026-05-28 |
| 7 | GPT-5.3-Codex | 5940.12 | GPT-5.3-Codex openai-gpt-5.3-codex | Imported | 2026-05-28 |
| 8 | Claude Opus 4.8 - High | 5787.43 | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 9 | GLM-5.1 | 5634.41 | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 10 | Gemini 3 Pro | 5478.16 | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 11 | Gemini 3.5 Flash | 5396.42 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 12 | Qwen 3.6 Plus | 5114.87 | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-28 |
| 13 | Claude Opus 4.5 | 4967.06 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-28 |
| 14 | Grok 4.20 | 4662.85 | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 15 | GLM-5 | 4432.12 | GLM 5 z-ai-glm-5 | Imported | 2026-05-28 |
| 16 | Qwen 3.6 Max | 4254.19 | — | Imported | 2026-05-28 |
| 17 | Claude Sonnet 4.5 | 3838.74 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-28 |
| 18 | Gemini 3.1 Pro Custom Tools | 3774.25 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 19 | Gemini 3 Flash | 3634.72 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 20 | GPT-5.2 | 3591.33 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-28 |
| 21 | Deepseek V4 Pro | 3284.52 | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 22 | Claude Opus 4.8 - Max | 2992.34 | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 23 | GLM-4.7 | 2376.82 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-28 |
| 24 | GPT-5.1 | 1473.43 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-28 |
| 25 | Kimi K2.5 | 1198.46 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-28 |
| 26 | Grok 4.1 Fast | 1106.63 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-28 |
| 27 | DeepSeek-V3.2 | 1034.00 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-28 |
| 28 | Gemini 3.1 Pro | 911.21 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 29 | Gemini 2.5 Pro | 573.64 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 30 | Gemini 2.5 Flash | 548.84 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-28 |
| 31 | Qwen 3.5 35B A3B | 462.69 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Imported | 2026-05-28 |
| 32 | Claude Haiku 4.5 | 458.89 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 33 | Qwen 3.5 27B | 201.98 | Qwen3.5-27B qwen-qwen3.5-27b | Imported | 2026-05-28 |
| 34 | MiniMax-M2 | 160.60 | MiniMax M2 minimax-minimax-m2 | Imported | 2026-05-28 |
| 35 | Qwen3 Max | 71.57 | Qwen3 Max qwen-qwen3-max | Imported | 2026-05-28 |
| 36 | Grok 4.3 | 35.26 | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
| 37 | Qwen 3.5 Plus | 0.54 | Qwen3.5 Plus 2026-04-20 qwen-qwen3.5-plus-20260420 | Imported | 2026-05-28 |
| 38 | Qwen3 235B A22B Thinking | -11.34 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Imported | 2026-05-28 |
| 39 | GPT-OSS-120b | -21.53 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-28 |
| 40 | MiniMax-M2.5 | -23.16 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-28 |
| 41 | GPT-5 mini | -31.18 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-28 |
| 1 | Claude Opus 4.7 (max effort) | 10937 USD | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.7 (high effort) | 7971 USD | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.8 (high effort) | 5787.4 USD | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.8 (max effort) | 2992.3 USD | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
No matching rows.