ProgramBench | BenchmarkList

Metadata

Resolved, Almost Resolved, Avg. Cost (lower is better), Avg. Calls (lower is better)

Rank	Subject	Resolved	Model Match	Provenance	Sampled
1	Claude Opus 4.7 + mini-SWE-agent	0%	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-05
2	Claude Opus 4.6 + mini-SWE-agent	0%	Claude Opus 4.6 anthropic-claude-opus-4.6	Imported	2026-05-05
3	Claude Sonnet 4.6 + mini-SWE-agent	0%	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-05
4	GPT-5.4 + mini-SWE-agent	0%	GPT-5.4 openai-gpt-5.4	Imported	2026-05-05
5	Gemini 3.1 Pro + mini-SWE-agent	0%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-05
6	Gemini 3 Flash + mini-SWE-agent	0%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-05
7	Claude Haiku 4.5 + mini-SWE-agent	0%	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-05
8	GPT-5.4 Mini + mini-SWE-agent	0%	GPT-5.4 Mini openai-gpt-5.4-mini	Imported	2026-05-05
9	GPT-5 Mini + mini-SWE-agent	0%	GPT-5 Mini openai-gpt-5-mini	Imported	2026-05-05