SWE-bench Verified
Human annotated subset of SWE-bench with 500 verified software engineering tasks.
13rows
resolvedprimary metric
2026-05-28sampled
Metadata
Metrics
Resolved
Showing 4 latest source slices.
| Rank | Subject | Resolved | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 88.6% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.7 | 87.6% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 3 | Gemini 3.1 Pro Preview | 80.6% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
| 1 | Claude Opus 4.6 Max | 80.8% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 2 | DeepSeek V4 Pro Max | 80.6% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 3 | Qwen3.7 Max | 80.4% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 4 | Kimi K2.6 Thinking | 80.2% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 5 | Qwen3.6 Plus | 78.8% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | Claude Mythos Preview | 93.9% | Claude Mythos Preview anthropic-claude-mythos-preview | Launch post | 2026-04-16 |
| 2 | Claude Opus 4.7 | 87.6% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-16 |
| 3 | Claude Opus 4.6 | 80.8% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Launch post | 2026-04-16 |
| 4 | Gemini 3.1 Pro Preview | 80.6% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-16 |
| 1 | GPT-4o (2024-05-13) | 33.2% | GPT-4o (2024-05-13) openai-gpt-4o-2024-05-13 | Imported | 2024-08-13 |
No matching rows.