Multi-SWE-Bench
A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.
6rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | MiniMax M2.7 | 0.53 | MiniMax M2.7 minimax-minimax-m2.7 | Self-reported | 2026-05-06 |
| 2 | MiniMax M2.5 | 0.51 | MiniMax M2.5 minimax-minimax-m2.5 | Self-reported | 2026-05-06 |
| 3 | MiniMax M2.1 | 0.49 | MiniMax M2.1 minimax-minimax-m2.1 | Self-reported | 2026-05-06 |
| 4 | Kimi K2-Thinking-0905 | 0.42 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Self-reported | 2026-05-06 |
| 5 | MiniMax M2 | 0.36 | MiniMax M2 minimax-minimax-m2 | Self-reported | 2026-05-06 |
| 6 | Qwen3-Coder 480B A35B Instruct | 0.26 | Qwen3 Coder 480B A35B qwen-qwen3-coder | Self-reported | 2026-05-06 |
No matching rows.