SWE Atlas - Test Writing

SWE Atlas Test Writing evaluates coding agents on writing production-grade tests for specific behaviors in real-world software repositories.

12rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Confidence Interval Upper, Max Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Gpt-5.4-xHigh (Codex CLI) 44.36 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
1 Gpt-5.4-xhigh (Mini-SWE) 40 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
1 Gpt-5.3-Codex-Xhigh (Codex) 38.98 GPT-5.3-Codex
openai-gpt-5.3-codex
Imported 2026-05-06
1 Opus-4.6 (Claude Code) 36.67 Imported 2026-05-06
1 Opus-4.6 (Mini-SWE) 36.08 Imported 2026-05-06
2 Sonnet-4.6 (Claude Code) 31.76 Imported 2026-05-06
2 Muse Spark 31.11 Imported 2026-05-06
2 Gemini-3-Flash (Mini-SWE) 30.30 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
2 Gemini-3.1-Pro (Mini-SWE) 29.84 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
2 Glm-5 (Mini-SWE) 28.74 GLM GLM 5
z-ai-glm-5
Imported 2026-05-06
5 Kimi-K2.5 (Mini-SWE) 25.77 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
10 Minimax-M2.5 (Mini-SWE) 18.60 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-06