Terminal-Bench 2.1
State-of-the-art set of difficult terminal-based tasks
20rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT 5.5 | 76.404% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 2 | Gemini 3.5 Flash | 74.157% | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 3 | Claude Opus 4.8 | 71.91% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 4 | Gemini 3.1 Pro Preview | 70.787% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 5 | Claude Opus 4.7 | 68.539% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 6 | Qwen 3.7 Max | 61.049% | Qwen3.7 Max qwen-qwen3.7-max | Imported | 2026-05-28 |
| 7 | GLM 5.1 Thinking | 56.929% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 8 | Gemini 3 Flash Preview | 53.933% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 9 | Kimi K2.6 Thinking | 53.558% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 10 | Qwen 3.6 Plus | 53.184% | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-28 |
| 11 | DeepSeek V4 Pro | 50.187% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 12 | MiniMax M2.7 | 48.689% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-28 |
| 13 | Grok 4.20 0309 Reasoning | 44.195% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 14 | Claude Haiku 4.5 20251001 Thinking | 43.82% | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 15 | Mistral Medium 3.5 | 38.951% | Mistral: Mistral Medium 3.5 mistralai-mistral-medium-3-5 | Imported | 2026-05-28 |
| 16 | Command A Plus 05 2026 | 17.603% | — | Imported | 2026-05-28 |
| 1 | GPT-5.5 | 78.2% | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 74.6% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 3 | Gemini 3.1 Pro Preview | 70.3% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.7 | 66.1% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
No matching rows.