ProgramBench

A benchmark where software engineering agents rebuild complete programs from compiled binaries and documentation, then are scored against hidden behavioral tests.

9rows
resolved_percentprimary metric
2026-05-05sampled

Metadata

Metrics

Resolved, Almost Resolved, Avg. Cost (lower is better), Avg. Calls (lower is better)

Latest Results

Evaluated with mini-SWE-agent on 200 tasks. Resolved is the primary metric; almost resolved counts instances where at least 95% of behavioral tests pass.

Rank Subject Resolved Model Match Provenance Sampled
1 Claude Opus 4.7 + mini-SWE-agent 0% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-05
2 Claude Opus 4.6 + mini-SWE-agent 0% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-05
3 Claude Sonnet 4.6 + mini-SWE-agent 0% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-05
4 GPT-5.4 + mini-SWE-agent 0% GPT-5.4
openai-gpt-5.4
Imported 2026-05-05
5 Gemini 3.1 Pro + mini-SWE-agent 0% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-05
6 Gemini 3 Flash + mini-SWE-agent 0% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-05
7 Claude Haiku 4.5 + mini-SWE-agent 0% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-05
8 GPT-5.4 Mini + mini-SWE-agent 0% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-05
9 GPT-5 Mini + mini-SWE-agent 0% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-05