Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

80rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 4 latest source slices.

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 GPT 5.5 73.202% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
2 Claude Opus 4.8 70.037% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
3 Claude Opus 4.7 68.539% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
4 Gemini 3.1 Pro Preview 67.416% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
5 Gemini 3.5 Flash 67.416% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
6 GPT 5.3 Codex 64.045% GPT-5.3-Codex
openai-gpt-5.3-codex
Imported 2026-05-28
7 Claude Sonnet 4.6 59.551% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
8 Muse Spark 59.551% Imported 2026-05-28
9 Qwen 3.7 Max 59.176% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
10 Claude Opus 4.5 20251101 58.427% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
11 Claude Opus 4.6 Thinking 58.427% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
12 GPT 5.4 2026-03-05 58.427% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
13 Kimi K2.6 Thinking 57.303% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
14 DeepSeek V4 Pro 56.18% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
15 Gemini 3 Pro Preview 55.056% Gemini 3
google-gemini-3
Imported 2026-05-28
16 Claude Opus 4.5 20251101 Thinking 53.933% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
17 GLM 5.1 Thinking 53.933% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
18 Qwen 3.6 Max Preview 51.685% Qwen3.6 Max Preview
qwen-qwen3.6-max-preview
Imported 2026-05-28
19 Gemini 3 Flash Preview 51.685% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
20 GPT 5.2 2025-12-11 51.685% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
21 GLM 5 Thinking 49.438% GLM GLM 5
z-ai-glm-5
Imported 2026-05-28
22 MiniMax M2.7 47.191% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
23 Qwen 3.6 27B 44.944% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
24 Qwen 3.6 Plus 44.944% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
25 GPT 5.1 2025-11-13 44.944% GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
26 GPT 5.4 Mini 2026-03-17 44.944% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
27 Grok 4.3 43.446% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
28 Qwen 3.5 Plus Thinking 41.573% Imported 2026-05-28
29 Claude Sonnet 4.5 20250929 Thinking 41.573% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
30 MiniMax M2.5 Lightning 41.573% Imported 2026-05-28
31 Grok 4.20 0309 Reasoning 40.449% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
32 Kimi K2.5 Thinking 40.449% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
33 GPT 5.4 Nano 2026-03-17 39.888% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
34 Gemma 4 31B It 39.326% Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-28
35 Claude Haiku 4.5 20251001 Thinking 38.202% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
36 GLM 4.7 38.202% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
37 Kimi K2 Thinking 37.079% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-28
38 MiniMax M2.1 37.079% MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-05-28
39 GPT 5.2025-08-07 37.079% GPT-5
openai-gpt-5
Imported 2026-05-28
40 DeepSeek V3P2 Thinking 35.955% Imported 2026-05-28
41 DeepSeek V3P2 34.831% Imported 2026-05-28
42 Gemini 2.5 Pro 30.337% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
43 Mistral Medium 3.5 30.337% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
44 Grok 4 Fast Reasoning 29.213% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
45 Grok 4.0709 28.09% GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
46 GLM 4.6 28.09% GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-05-28
47 GPT 5 Mini 2025-08-07 26.966% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
48 Kimi K2 Instruct 25.843% KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-28
49 Qwen 3 Max 24.719% Qwen3 Max
qwen-qwen3-max
Imported 2026-05-28
50 Qwen 3.5 Flash 24.719% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-28
51 Gemini 3.1 Flash Lite Preview 24.719% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
52 Grok 4.1 Fast Reasoning 24.719% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
53 DeepSeek V3P1 22.472% Imported 2026-05-28
54 Gemini 2.5 Flash Preview 09 2025 Thinking 21.348% Imported 2026-05-28
55 Qwen 3 Max 2026-01-23 20.225% Imported 2026-05-28
56 GPT Oss 120B 19.101% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
57 Trinity Large Thinking 17.978% A Trinity Large Thinking
arcee-ai-trinity-large-thinking
Imported 2026-05-28
58 Grok 4.1 Fast Non Reasoning 17.978% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
59 Command A Plus 05 2026 16.854% Imported 2026-05-28
60 Mistral Small 2603 16.854% Mistral: Mistral Small 4
mistralai-mistral-small-2603
Imported 2026-05-28
61 GPT 4.1 2025-04-14 14.607% GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
62 Magistral Medium 2509 13.483% Imported 2026-05-28
63 Mistral Large 2512 8.989% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-28
64 Command A 03 2025 2.247% C Command A
cohere-command-a
Imported 2026-05-28
65 Llama4 Maverick Instruct Basic 2.247% Imported 2026-05-28
1 Qwen3.7 Max 69.7% Qwen3.7 Max
qwen-qwen3.7-max
Self-reported 2026-05-28
2 DeepSeek V4 Pro Max 67.9% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Self-reported 2026-05-28
3 Kimi K2.6 Thinking 66.7% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-28
4 Claude Opus 4.6 Max 65.4% Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-28
5 GLM-5.1 Thinking 63.5% GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-28
6 Qwen3.6 Plus 61.6% Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-28
1 GPT-5.5 82.7% GPT-5.5
openai-gpt-5.5
Launch post 2026-04-23
2 GPT-5.4 75.1% GPT-5.4
openai-gpt-5.4
Launch post 2026-04-23
3 Claude Opus 4.7 69.4% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-23
4 Gemini 3.1 Pro Preview 68.5% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-23
1 Claude Mythos Preview 82% Claude Mythos Preview
anthropic-claude-mythos-preview
Launch post 2026-04-16
2 GPT-5.4 75.1% GPT-5.4
openai-gpt-5.4
Launch post 2026-04-16
3 Claude Opus 4.7 69.4% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-16
4 Gemini 3.1 Pro Preview 68.5% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-16
5 Claude Opus 4.6 65.4% Claude Opus 4.6
anthropic-claude-opus-4.6
Launch post 2026-04-16