DeepSWE

Long-horizon software engineering benchmark measuring frontier coding agents on original tasks from active open-source repositories with isolated environments and program-based verifiers.

16rows
pass_at_1primary metric
2026-05-26sampled

Metadata

Metrics

Pass@1, Pass@1 95% CI Half Width (lower is better), Pass@4, Passed Attempts, Attempted Rollouts, Tasks Attempted, Tasks Passed by Any Rollout, Median Cost to Pass (lower is better), Median Steps (lower is better), Median Output Tokens to Pass (lower is better), Median Agent Steps to Pass (lower is better)

Latest Results

Rows ranked by pass@1. Source says context-window failures and agent timeouts are scored failures, while provider, verifier, and network errors are excluded.

Rank Subject Pass@1 Model Match Provenance Sampled
1 GPT-5.5 + mini-SWE-agent (xhigh) 70.05 GPT-5.5
openai-gpt-5.5
Imported 2026-05-26
2 GPT-5.4 + mini-SWE-agent (xhigh) 55.53 GPT-5.4
openai-gpt-5.4
Imported 2026-05-26
3 Claude Opus 4.7 + mini-SWE-agent (max) 54.20 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-26
4 Claude Sonnet 4.6 + mini-SWE-agent (high) 31.56 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-26
5 Gemini 3.5 Flash + mini-SWE-agent (medium) 28.32 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-26
6 Claude Opus 4.6 + mini-SWE-agent (max) 27.06 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-26
7 GPT-5.4 Mini + mini-SWE-agent (xhigh) 24.34 GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-26
8 Kimi K2.6 + mini-SWE-agent 23.89 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-26
9 MiMo v2.5 Pro + mini-SWE-agent 19.47 MiMo-V2.5-Pro
xiaomi-mimo-v2.5-pro
Imported 2026-05-26
10 GLM-5.1 + mini-SWE-agent 17.48 GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-26
11 Gemini 3.1 Pro Preview + mini-SWE-agent 9.88 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-26
12 DeepSeek V4 Pro + mini-SWE-agent 7.52 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-26
13 Gemini 3 Flash Preview + mini-SWE-agent 5.16 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-26
14 Qwen3.6 Plus + mini-SWE-agent 2.65 Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-26
15 Claude Haiku 4.5 + mini-SWE-agent 0.22 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-26
16 MiniMax M2.7 + mini-SWE-agent 0.22 MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-26