ABC-Bench

Agentic backend coding benchmark evaluating whether coding agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external HTTP integration tests.

11rows
overall_pass_at_1primary metric
2026-05-27sampled

Metadata

Metrics

Overall pass@1, Overall pass@1 CI (lower is better), Python pass@1, Go pass@1, JavaScript pass@1, Java pass@1, Ruby pass@1, C# pass@1, PHP pass@1, Rust pass@1

Latest Results

Rows are transcribed from the public ABC-Bench arXiv paper Table 2. Primary score is overall average pass@1.

Rank Subject Overall pass@1 Model Match Provenance Sampled
1 Claude Sonnet 4.5 63.2% +/- 1.9 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
2 DeepSeek-V3.2 50.1% +/- 1.9 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-27
3 GPT-5 49.4% +/- 1.9 GPT-5
openai-gpt-5
Imported 2026-05-27
4 Qwen3-Coder-480B-A35B 43.1% +/- 1.9 Qwen3 Coder 480B A35B
qwen-qwen3-coder
Imported 2026-05-27
5 Nex-N1-671B 42.1% +/- 1.9 Imported 2026-05-27
6 GLM 4.7 40.1% +/- 1.9 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-27
7 Nex-N1-32B 34.5% +/- 1.8 Imported 2026-05-27
8 Qwen3-Coder-30B-A3B 28.6% +/- 1.7 Imported 2026-05-27
9 Gemini 2.5 Pro 25.0% +/- 1.7 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-27
10 Qwen3-32B 8.9% +/- 1.1 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-27
11 Qwen3-8B 8.3% +/- 1.1 Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27