Terminal-Bench 2.0
State-of-the-art set of difficult terminal-based tasks
80rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
Showing 4 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT 5.5 | 73.202% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 70.037% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 68.539% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 4 | Gemini 3.1 Pro Preview | 67.416% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 5 | Gemini 3.5 Flash | 67.416% | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 6 | GPT 5.3 Codex | 64.045% | GPT-5.3-Codex openai-gpt-5.3-codex | Imported | 2026-05-28 |
| 7 | Claude Sonnet 4.6 | 59.551% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 8 | Muse Spark | 59.551% | — | Imported | 2026-05-28 |
| 9 | Qwen 3.7 Max | 59.176% | Qwen3.7 Max qwen-qwen3.7-max | Imported | 2026-05-28 |
| 10 | Claude Opus 4.5 20251101 | 58.427% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-28 |
| 11 | Claude Opus 4.6 Thinking | 58.427% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-28 |
| 12 | GPT 5.4 2026-03-05 | 58.427% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 13 | Kimi K2.6 Thinking | 57.303% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 14 | DeepSeek V4 Pro | 56.18% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 15 | Gemini 3 Pro Preview | 55.056% | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 16 | Claude Opus 4.5 20251101 Thinking | 53.933% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-28 |
| 17 | GLM 5.1 Thinking | 53.933% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 18 | Qwen 3.6 Max Preview | 51.685% | Qwen3.6 Max Preview qwen-qwen3.6-max-preview | Imported | 2026-05-28 |
| 19 | Gemini 3 Flash Preview | 51.685% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 20 | GPT 5.2 2025-12-11 | 51.685% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-28 |
| 21 | GLM 5 Thinking | 49.438% | GLM 5 z-ai-glm-5 | Imported | 2026-05-28 |
| 22 | MiniMax M2.7 | 47.191% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-28 |
| 23 | Qwen 3.6 27B | 44.944% | Qwen3.6 27B qwen-qwen3.6-27b | Imported | 2026-05-28 |
| 24 | Qwen 3.6 Plus | 44.944% | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-28 |
| 25 | GPT 5.1 2025-11-13 | 44.944% | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-28 |
| 26 | GPT 5.4 Mini 2026-03-17 | 44.944% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-28 |
| 27 | Grok 4.3 | 43.446% | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
| 28 | Qwen 3.5 Plus Thinking | 41.573% | — | Imported | 2026-05-28 |
| 29 | Claude Sonnet 4.5 20250929 Thinking | 41.573% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-28 |
| 30 | MiniMax M2.5 Lightning | 41.573% | — | Imported | 2026-05-28 |
| 31 | Grok 4.20 0309 Reasoning | 40.449% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 32 | Kimi K2.5 Thinking | 40.449% | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-28 |
| 33 | GPT 5.4 Nano 2026-03-17 | 39.888% | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-28 |
| 34 | Gemma 4 31B It | 39.326% | Gemma 4 31B google-gemma-4-31b-it | Imported | 2026-05-28 |
| 35 | Claude Haiku 4.5 20251001 Thinking | 38.202% | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 36 | GLM 4.7 | 38.202% | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-28 |
| 37 | Kimi K2 Thinking | 37.079% | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-28 |
| 38 | MiniMax M2.1 | 37.079% | MiniMax M2.1 minimax-minimax-m2.1 | Imported | 2026-05-28 |
| 39 | GPT 5.2025-08-07 | 37.079% | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 40 | DeepSeek V3P2 Thinking | 35.955% | — | Imported | 2026-05-28 |
| 41 | DeepSeek V3P2 | 34.831% | — | Imported | 2026-05-28 |
| 42 | Gemini 2.5 Pro | 30.337% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 43 | Mistral Medium 3.5 | 30.337% | Mistral: Mistral Medium 3.5 mistralai-mistral-medium-3-5 | Imported | 2026-05-28 |
| 44 | Grok 4 Fast Reasoning | 29.213% | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-05-28 |
| 45 | Grok 4.0709 | 28.09% | Grok 4 x-ai-grok-4 | Imported | 2026-05-28 |
| 46 | GLM 4.6 | 28.09% | GLM 4.6 z-ai-glm-4.6 | Imported | 2026-05-28 |
| 47 | GPT 5 Mini 2025-08-07 | 26.966% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-28 |
| 48 | Kimi K2 Instruct | 25.843% | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-05-28 |
| 49 | Qwen 3 Max | 24.719% | Qwen3 Max qwen-qwen3-max | Imported | 2026-05-28 |
| 50 | Qwen 3.5 Flash | 24.719% | Qwen3.5-Flash qwen-qwen3.5-flash-02-23 | Imported | 2026-05-28 |
| 51 | Gemini 3.1 Flash Lite Preview | 24.719% | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-28 |
| 52 | Grok 4.1 Fast Reasoning | 24.719% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-28 |
| 53 | DeepSeek V3P1 | 22.472% | — | Imported | 2026-05-28 |
| 54 | Gemini 2.5 Flash Preview 09 2025 Thinking | 21.348% | — | Imported | 2026-05-28 |
| 55 | Qwen 3 Max 2026-01-23 | 20.225% | — | Imported | 2026-05-28 |
| 56 | GPT Oss 120B | 19.101% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-28 |
| 57 | Trinity Large Thinking | 17.978% | Trinity Large Thinking arcee-ai-trinity-large-thinking | Imported | 2026-05-28 |
| 58 | Grok 4.1 Fast Non Reasoning | 17.978% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-28 |
| 59 | Command A Plus 05 2026 | 16.854% | — | Imported | 2026-05-28 |
| 60 | Mistral Small 2603 | 16.854% | Mistral: Mistral Small 4 mistralai-mistral-small-2603 | Imported | 2026-05-28 |
| 61 | GPT 4.1 2025-04-14 | 14.607% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-28 |
| 62 | Magistral Medium 2509 | 13.483% | — | Imported | 2026-05-28 |
| 63 | Mistral Large 2512 | 8.989% | Mistral: Mistral Large 3 2512 mistralai-mistral-large-2512 | Imported | 2026-05-28 |
| 64 | Command A 03 2025 | 2.247% | Command A cohere-command-a | Imported | 2026-05-28 |
| 65 | Llama4 Maverick Instruct Basic | 2.247% | — | Imported | 2026-05-28 |
| 1 | Qwen3.7 Max | 69.7% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 2 | DeepSeek V4 Pro Max | 67.9% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 3 | Kimi K2.6 Thinking | 66.7% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.6 Max | 65.4% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 5 | GLM-5.1 Thinking | 63.5% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 6 | Qwen3.6 Plus | 61.6% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | GPT-5.5 | 82.7% | GPT-5.5 openai-gpt-5.5 | Launch post | 2026-04-23 |
| 2 | GPT-5.4 | 75.1% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-23 |
| 3 | Claude Opus 4.7 | 69.4% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-23 |
| 4 | Gemini 3.1 Pro Preview | 68.5% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-23 |
| 1 | Claude Mythos Preview | 82% | Claude Mythos Preview anthropic-claude-mythos-preview | Launch post | 2026-04-16 |
| 2 | GPT-5.4 | 75.1% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-16 |
| 3 | Claude Opus 4.7 | 69.4% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-16 |
| 4 | Gemini 3.1 Pro Preview | 68.5% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-16 |
| 5 | Claude Opus 4.6 | 65.4% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Launch post | 2026-04-16 |
No matching rows.