ABC-Bench
Agentic backend coding benchmark evaluating whether coding agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external HTTP integration tests.
11rows
overall_pass_at_1primary metric
2026-05-27sampled
Metadata
Metrics
Overall pass@1, Overall pass@1 CI (lower is better), Python pass@1, Go pass@1, JavaScript pass@1, Java pass@1, Ruby pass@1, C# pass@1, PHP pass@1, Rust pass@1
| Rank | Subject | Overall pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.5 | 63.2% +/- 1.9 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-27 |
| 2 | DeepSeek-V3.2 | 50.1% +/- 1.9 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-27 |
| 3 | GPT-5 | 49.4% +/- 1.9 | GPT-5 openai-gpt-5 | Imported | 2026-05-27 |
| 4 | Qwen3-Coder-480B-A35B | 43.1% +/- 1.9 | Qwen3 Coder 480B A35B qwen-qwen3-coder | Imported | 2026-05-27 |
| 5 | Nex-N1-671B | 42.1% +/- 1.9 | — | Imported | 2026-05-27 |
| 6 | GLM 4.7 | 40.1% +/- 1.9 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-27 |
| 7 | Nex-N1-32B | 34.5% +/- 1.8 | — | Imported | 2026-05-27 |
| 8 | Qwen3-Coder-30B-A3B | 28.6% +/- 1.7 | — | Imported | 2026-05-27 |
| 9 | Gemini 2.5 Pro | 25.0% +/- 1.7 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
| 10 | Qwen3-32B | 8.9% +/- 1.1 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 11 | Qwen3-8B | 8.3% +/- 1.1 | Qwen3 8B qwen-qwen3-8b | Imported | 2026-05-27 |
No matching rows.