ProgramBench
A benchmark where software engineering agents rebuild complete programs from compiled binaries and documentation, then are scored against hidden behavioral tests.
9rows
resolved_percentprimary metric
2026-05-05sampled
Metadata
Metrics
Resolved, Almost Resolved, Avg. Cost (lower is better), Avg. Calls (lower is better)
| Rank | Subject | Resolved | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 + mini-SWE-agent | 0% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-05 |
| 2 | Claude Opus 4.6 + mini-SWE-agent | 0% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-05 |
| 3 | Claude Sonnet 4.6 + mini-SWE-agent | 0% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-05 |
| 4 | GPT-5.4 + mini-SWE-agent | 0% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-05 |
| 5 | Gemini 3.1 Pro + mini-SWE-agent | 0% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-05 |
| 6 | Gemini 3 Flash + mini-SWE-agent | 0% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-05 |
| 7 | Claude Haiku 4.5 + mini-SWE-agent | 0% | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-05 |
| 8 | GPT-5.4 Mini + mini-SWE-agent | 0% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-05 |
| 9 | GPT-5 Mini + mini-SWE-agent | 0% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-05 |
No matching rows.