Multi-SWE-Bench

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.

6rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 MiniMax M2.7 0.53 MiniMax M2.7
minimax-minimax-m2.7
Self-reported 2026-05-06
2 MiniMax M2.5 0.51 MiniMax M2.5
minimax-minimax-m2.5
Self-reported 2026-05-06
3 MiniMax M2.1 0.49 MiniMax M2.1
minimax-minimax-m2.1
Self-reported 2026-05-06
4 Kimi K2-Thinking-0905 0.42 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Self-reported 2026-05-06
5 MiniMax M2 0.36 MiniMax M2
minimax-minimax-m2
Self-reported 2026-05-06
6 Qwen3-Coder 480B A35B Instruct 0.26 Qwen3 Coder 480B A35B
qwen-qwen3-coder
Self-reported 2026-05-06