SWE-bench Verified

Solving production software engineering tasks

50rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.8 88.6% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
2 GPT 5.5 82.6% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
3 Claude Opus 4.7 82% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
4 Gemini 3.1 Pro Preview 78.8% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
5 Gemini 3.5 Flash 78.8% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
6 Claude Opus 4.6 Thinking 78.2% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
7 GPT 5.4 2026-03-05 78.2% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
8 GPT 5.3 Codex 78% GPT-5.3-Codex
openai-gpt-5.3-codex
Imported 2026-05-28
9 Claude Sonnet 4.6 77.4% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
10 DeepSeek V4 Pro 77.4% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
11 Claude Opus 4.5 20251101 Thinking 76.4% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
12 Gemini 3 Pro Preview 76.4% Gemini 3
google-gemini-3
Imported 2026-05-28
13 GLM 5.1 Thinking 76.4% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
14 Kimi K2.6 Thinking 76.2% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
15 GPT 5.2 2025-12-11 75.8% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
16 Gemini 3 Flash Preview 75% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
17 MiniMax M2.1 74.8% MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-05-28
18 Muse Spark 74.4% Imported 2026-05-28
19 MiniMax M2.5 Lightning 74.2% Imported 2026-05-28
20 MiniMax M2.7 73.8% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
21 Qwen 3.6 Plus 73.4% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
22 GPT 5.4 Mini 2026-03-17 73% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
23 Qwen 3.6 Max Preview 72.8% Qwen3.6 Max Preview
qwen-qwen3.6-max-preview
Imported 2026-05-28
24 GPT 5.2 Codex 72.4% GPT-5.2-Codex
openai-gpt-5.2-codex
Imported 2026-05-28
25 Grok 4.20 0309 Reasoning 72.2% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
26 Grok 4.3 71.4% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
27 GLM 5 Thinking 71.4% GLM GLM 5
z-ai-glm-5
Imported 2026-05-28
28 Qwen 3.5 Plus Thinking 71.2% Imported 2026-05-28
29 Qwen 3.6 27B 70% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
30 Claude Sonnet 4.5 20250929 Thinking 70% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
31 Kimi K2.5 Thinking 70% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
32 GPT 5.1 2025-11-13 69.8% GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
33 GPT 5.4 Nano 2026-03-17 69.8% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
34 GLM 4.7 69.4% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
35 GPT 5.2025-08-07 69% GPT-5
openai-gpt-5
Imported 2026-05-28
36 Qwen 3.7 Max 68.8% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
37 DeepSeek V3P2 Thinking 67.6% Imported 2026-05-28
38 Claude Haiku 4.5 20251001 Thinking 66.6% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
39 Qwen 3.5 Flash 64.4% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-28
40 Gemini 3.1 Flash Lite Preview 62.8% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
41 Devstral 2512 62.8% Mistral: Devstral 2 2512
mistralai-devstral-2512
Imported 2026-05-28
42 GPT 5 Mini 2025-08-07 60.8% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
43 Kimi K2 Thinking 60.2% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-28
44 Grok 4.0709 57.8% GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
45 Gemini 2.5 Pro 54.4% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
46 Grok 4 Fast Reasoning 45.4% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
47 Grok 4.1 Fast Reasoning 41.4% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
48 Mistral Large 2512 41.4% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-28
49 GPT Oss 120B 33.6% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
50 Command A 03 2025 7.8% C Command A
cohere-command-a
Imported 2026-05-28