Terminal-Bench 2.1

State-of-the-art set of difficult terminal-based tasks

20rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 2 latest source slices.

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 GPT 5.5 76.404% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
2 Gemini 3.5 Flash 74.157% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
3 Claude Opus 4.8 71.91% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
4 Gemini 3.1 Pro Preview 70.787% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
5 Claude Opus 4.7 68.539% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
6 Qwen 3.7 Max 61.049% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
7 GLM 5.1 Thinking 56.929% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
8 Gemini 3 Flash Preview 53.933% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
9 Kimi K2.6 Thinking 53.558% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
10 Qwen 3.6 Plus 53.184% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
11 DeepSeek V4 Pro 50.187% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
12 MiniMax M2.7 48.689% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
13 Grok 4.20 0309 Reasoning 44.195% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
14 Claude Haiku 4.5 20251001 Thinking 43.82% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
15 Mistral Medium 3.5 38.951% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
16 Command A Plus 05 2026 17.603% Imported 2026-05-28
1 GPT-5.5 78.2% GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-28
2 Claude Opus 4.8 74.6% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
3 Gemini 3.1 Pro Preview 70.3% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-28
4 Claude Opus 4.7 66.1% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28