DevBench
Software-development lifecycle benchmark covering environment setup, implementation, acceptance testing, unit testing, and software design.
9rows
benchmarklist_metric_meanprimary metric
2026-05-27sampled
Metadata
Metrics
Implementation Pass@ Acceptance Test, Implementation Pass@ Unit Test, Unit Testing Coverage, Software Design General Principles w/o Tie, Software Design Faithfulness w/o Tie
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4-Turbo-0125 | 56.1636 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-27 |
| 2 | GPT-4-Turbo-1106 | 54.1273 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-27 |
| 3 | DeepSeek-Coder-33B-Instruct | 35.8182 | — | Imported | 2026-05-27 |
| 4 | DeepSeek-Coder-6.7B-Instruct | 27.8545 | — | Imported | 2026-05-27 |
| 5 | CodeLlama-34B-Instruct | 24.4545 | — | Imported | 2026-05-27 |
| 6 | GPT-3.5-Turbo | 24.0286 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 7 | CodeLlama-13B-Instruct | 13.2818 | — | Imported | 2026-05-27 |
| 8 | CodeLlama-7B-Instruct | 9.7364 | — | Imported | 2026-05-27 |
| 9 | DeepSeek-Coder-1.3B-Instruct | 8.0182 | — | Imported | 2026-05-27 |
No matching rows.