SWE-bench Verified (Bash Only)

Bash-only variant of SWE-bench Verified for real-world GitHub issue resolution.

12rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.5 64.80 Imported 2026-05-06
2 GPT-5.2 61.60 Imported 2026-05-06
3 Claude 3.7 Sonnet 52.20 Imported 2026-05-06
4 DeepSeek V3 52.10 Imported 2026-05-06
5 o3 43.69 Imported 2026-05-06
6 GPT-4.1 41 Imported 2026-05-06
7 Grok-3 mini 38.60 Imported 2026-05-06
8 o4-mini-2025-04-16 medium 34.60 Imported 2026-05-06
9 GPT-4.1 mini 32.80 Imported 2026-05-06
10 Qwen Plus 28 Imported 2026-05-06
11 GPT-4o 25.40 Imported 2026-05-06
12 Gemini 2.5 Pro (Jun 2025) 22 Imported 2026-05-06