Terminal-Bench 2.1 | BenchmarkList

Metadata

ID: vals_terminal_bench_2_1
Category: Coding
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 2 latest source slices.

Rank	Subject	Score	Model Match	Provenance	Sampled
1	GPT 5.5	76.404%	GPT-5.5 openai-gpt-5.5	Imported	2026-05-28
2	Gemini 3.5 Flash	74.157%	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-28
3	Claude Opus 4.8	71.91%	Claude Opus 4.8 anthropic-claude-opus-4.8	Imported	2026-05-28
4	Gemini 3.1 Pro Preview	70.787%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-28
5	Claude Opus 4.7	68.539%	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-28
6	Qwen 3.7 Max	61.049%	Qwen3.7 Max qwen-qwen3.7-max	Imported	2026-05-28
7	GLM 5.1 Thinking	56.929%	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-28
8	Gemini 3 Flash Preview	53.933%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-28
9	Kimi K2.6 Thinking	53.558%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-28
10	Qwen 3.6 Plus	53.184%	Qwen3.6 Plus qwen-qwen3.6-plus	Imported	2026-05-28
11	DeepSeek V4 Pro	50.187%	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Imported	2026-05-28
12	MiniMax M2.7	48.689%	MiniMax M2.7 minimax-minimax-m2.7	Imported	2026-05-28
13	Grok 4.20 0309 Reasoning	44.195%	GROK Grok 4.20 x-ai-grok-4.20	Imported	2026-05-28
14	Claude Haiku 4.5 20251001 Thinking	43.82%	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-28
15	Mistral Medium 3.5	38.951%	Mistral: Mistral Medium 3.5 mistralai-mistral-medium-3-5	Imported	2026-05-28
16	Command A Plus 05 2026	17.603%	—	Imported	2026-05-28
1	GPT-5.5	78.2%	GPT-5.5 openai-gpt-5.5	Self-reported	2026-05-28
2	Claude Opus 4.8	74.6%	Claude Opus 4.8 anthropic-claude-opus-4.8	Self-reported	2026-05-28
3	Gemini 3.1 Pro Preview	70.3%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Self-reported	2026-05-28
4	Claude Opus 4.7	66.1%	Claude Opus 4.7 anthropic-claude-opus-4.7	Self-reported	2026-05-28

Metadata

Metrics

Latest Results