VibeCodingBench
Production-oriented coding benchmark evaluating AI coding agents across functional correctness, visual fidelity, code quality, security, cost, and speed on representative developer tasks.
15rows
avg_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Avg Score, Pass Rate, Tasks Completed, Functional, Visual, Quality, Security, Cost Score, Speed Score, Total Cost (lower is better), Avg Time (lower is better), Total Tokens (lower is better)
| Rank | Subject | Avg Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 89.15 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 2 | Claude Haiku 4.5 | 88.97 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-06 |
| 3 | Grok 4 Fast | 88.80 | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-05-06 |
| 4 | OpenAI GPT-5.2 | 88.75 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 5 | Qwen3 Max | 88.60 | Qwen3 Max qwen-qwen3-max | Imported | 2026-05-06 |
| 6 | Claude Sonnet 4.5 | 88.56 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 7 | GLM 4-Plus | 88.20 | — | Imported | 2026-05-06 |
| 8 | DeepSeek v3.2 | 88.19 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 9 | Grok 4 | 88 | Grok 4 x-ai-grok-4 | Imported | 2026-05-06 |
| 10 | MiniMax M2.1 | 87.42 | MiniMax M2.1 minimax-minimax-m2.1 | Imported | 2026-05-06 |
| 11 | Grok 4.1 Fast | 86.80 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-06 |
| 12 | Gemini 3 Pro Preview | 85.80 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 13 | GLM-4.7 | 83.90 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-06 |
| 14 | GLM 4.7 Flash | 83.83 | GLM 4.7 Flash z-ai-glm-4.7-flash | Imported | 2026-05-06 |
| 15 | Gemini 3 Flash | 83.44 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
No matching rows.