DeepSWE
Long-horizon software engineering benchmark measuring frontier coding agents on original tasks from active open-source repositories with isolated environments and program-based verifiers.
16rows
pass_at_1primary metric
2026-05-26sampled
Metadata
Metrics
Pass@1, Pass@1 95% CI Half Width (lower is better), Pass@4, Passed Attempts, Attempted Rollouts, Tasks Attempted, Tasks Passed by Any Rollout, Median Cost to Pass (lower is better), Median Steps (lower is better), Median Output Tokens to Pass (lower is better), Median Agent Steps to Pass (lower is better)
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.5 + mini-SWE-agent (xhigh) | 70.05 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-26 |
| 2 | GPT-5.4 + mini-SWE-agent (xhigh) | 55.53 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-26 |
| 3 | Claude Opus 4.7 + mini-SWE-agent (max) | 54.20 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-26 |
| 4 | Claude Sonnet 4.6 + mini-SWE-agent (high) | 31.56 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-26 |
| 5 | Gemini 3.5 Flash + mini-SWE-agent (medium) | 28.32 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-26 |
| 6 | Claude Opus 4.6 + mini-SWE-agent (max) | 27.06 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-26 |
| 7 | GPT-5.4 Mini + mini-SWE-agent (xhigh) | 24.34 | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-26 |
| 8 | Kimi K2.6 + mini-SWE-agent | 23.89 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-26 |
| 9 | MiMo v2.5 Pro + mini-SWE-agent | 19.47 | MiMo-V2.5-Pro xiaomi-mimo-v2.5-pro | Imported | 2026-05-26 |
| 10 | GLM-5.1 + mini-SWE-agent | 17.48 | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-26 |
| 11 | Gemini 3.1 Pro Preview + mini-SWE-agent | 9.88 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-26 |
| 12 | DeepSeek V4 Pro + mini-SWE-agent | 7.52 | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-26 |
| 13 | Gemini 3 Flash Preview + mini-SWE-agent | 5.16 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-26 |
| 14 | Qwen3.6 Plus + mini-SWE-agent | 2.65 | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-26 |
| 15 | Claude Haiku 4.5 + mini-SWE-agent | 0.22 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-26 |
| 16 | MiniMax M2.7 + mini-SWE-agent | 0.22 | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-26 |
No matching rows.