DevBench

Software-development lifecycle benchmark covering environment setup, implementation, acceptance testing, unit testing, and software design.

9rows
benchmarklist_metric_meanprimary metric
2026-05-27sampled

Metadata

Metrics

Implementation Pass@ Acceptance Test, Implementation Pass@ Unit Test, Unit Testing Coverage, Software Design General Principles w/o Tie, Software Design Faithfulness w/o Tie

Latest Results

Rows parsed from DevBench README coding-task and software-design result tables. Primary score is a BenchmarkList aggregate mean across finite published metrics.

Rank Subject Score Model Match Provenance Sampled
1 GPT-4-Turbo-0125 56.1636 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-27
2 GPT-4-Turbo-1106 54.1273 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-27
3 DeepSeek-Coder-33B-Instruct 35.8182 Imported 2026-05-27
4 DeepSeek-Coder-6.7B-Instruct 27.8545 Imported 2026-05-27
5 CodeLlama-34B-Instruct 24.4545 Imported 2026-05-27
6 GPT-3.5-Turbo 24.0286 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
7 CodeLlama-13B-Instruct 13.2818 Imported 2026-05-27
8 CodeLlama-7B-Instruct 9.7364 Imported 2026-05-27
9 DeepSeek-Coder-1.3B-Instruct 8.0182 Imported 2026-05-27