SWE Atlas - Test Writing
SWE Atlas Test Writing evaluates coding agents on writing production-grade tests for specific behaviors in real-world software repositories.
12rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Confidence Interval Upper, Max Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gpt-5.4-xHigh (Codex CLI) | 44.36 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 1 | Gpt-5.4-xhigh (Mini-SWE) | 40 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 1 | Gpt-5.3-Codex-Xhigh (Codex) | 38.98 | GPT-5.3-Codex openai-gpt-5.3-codex | Imported | 2026-05-06 |
| 1 | Opus-4.6 (Claude Code) | 36.67 | — | Imported | 2026-05-06 |
| 1 | Opus-4.6 (Mini-SWE) | 36.08 | — | Imported | 2026-05-06 |
| 2 | Sonnet-4.6 (Claude Code) | 31.76 | — | Imported | 2026-05-06 |
| 2 | Muse Spark | 31.11 | — | Imported | 2026-05-06 |
| 2 | Gemini-3-Flash (Mini-SWE) | 30.30 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 2 | Gemini-3.1-Pro (Mini-SWE) | 29.84 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 2 | Glm-5 (Mini-SWE) | 28.74 | GLM 5 z-ai-glm-5 | Imported | 2026-05-06 |
| 5 | Kimi-K2.5 (Mini-SWE) | 25.77 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 10 | Minimax-M2.5 (Mini-SWE) | 18.60 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-06 |
No matching rows.