SWE-bench Multilingual
SWE-bench extension with 300 software issue tasks spanning 9 programming languages.
21rows
resolvedprimary metric
2026-05-28sampled
Metadata
Metrics
Resolved
Showing 3 latest source slices.
| Rank | Subject | Resolved | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 84.4% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.7 | 80.5% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 1 | Qwen3.7 Max | 78.3% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.6 Max | 77.5% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 3 | Kimi K2.6 Thinking | 76.7% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 4 | DeepSeek V4 Pro Max | 76.2% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 5 | Qwen3.6 Plus | 73.8% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | Gemini 3 Flash | 72.7% | — | Imported | 2026-02-20 |
| 2 | Claude 4.6 Opus | 72% | — | Imported | 2026-02-20 |
| 3 | Claude 4.5 Opus | 70.7% | — | Imported | 2026-02-20 |
| 4 | GLM-5 | 69.7% | — | Imported | 2026-02-20 |
| 5 | Gemini 3 Pro | 68.7% | — | Imported | 2026-02-20 |
| 6 | Minimax 2.5 | 68.3% | — | Imported | 2026-02-20 |
| 7 | Kimi K2.5 | 67.3% | — | Imported | 2026-02-20 |
| 8 | Claude 4.5 Sonnet | 67% | — | Imported | 2026-02-20 |
| 9 | GPT-5.2 (high reasoning) | 66.7% | — | Imported | 2026-02-20 |
| 10 | GPT-5-2 Codex | 66.3% | — | Imported | 2026-02-20 |
| 11 | Claude 4.5 Haiku | 64.7% | — | Imported | 2026-02-20 |
| 12 | DeepSeek V3.2 | 59% | — | Imported | 2026-02-20 |
| 13 | GPT-5 mini | 39.7% | — | Imported | 2026-02-20 |
| 14 | GPT 5.2 Codex | 66.3% | — | Imported | 2026-02-20 |
No matching rows.