Terminal-Bench 2.0 | BenchmarkList

Metadata

ID: vals_terminal_bench_2
Category: Coding
Release: 2026-01-17
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 4 latest source slices.

Rank	Subject	Score	Model Match	Provenance	Sampled
1	GPT 5.5	73.202%	GPT-5.5 openai-gpt-5.5	Imported	2026-05-28
2	Claude Opus 4.8	70.037%	Claude Opus 4.8 anthropic-claude-opus-4.8	Imported	2026-05-28
3	Claude Opus 4.7	68.539%	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-28
4	Gemini 3.1 Pro Preview	67.416%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-28
5	Gemini 3.5 Flash	67.416%	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-28
6	GPT 5.3 Codex	64.045%	GPT-5.3-Codex openai-gpt-5.3-codex	Imported	2026-05-28
7	Claude Sonnet 4.6	59.551%	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-28
8	Muse Spark	59.551%	—	Imported	2026-05-28
9	Qwen 3.7 Max	59.176%	Qwen3.7 Max qwen-qwen3.7-max	Imported	2026-05-28
10	Claude Opus 4.5 20251101	58.427%	Claude Opus 4.5 anthropic-claude-opus-4.5	Imported	2026-05-28
11	Claude Opus 4.6 Thinking	58.427%	Claude Opus 4.6 anthropic-claude-opus-4.6	Imported	2026-05-28
12	GPT 5.4 2026-03-05	58.427%	GPT-5.4 openai-gpt-5.4	Imported	2026-05-28
13	Kimi K2.6 Thinking	57.303%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-28
14	DeepSeek V4 Pro	56.18%	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Imported	2026-05-28
15	Gemini 3 Pro Preview	55.056%	Gemini 3 google-gemini-3	Imported	2026-05-28
16	Claude Opus 4.5 20251101 Thinking	53.933%	Claude Opus 4.5 anthropic-claude-opus-4.5	Imported	2026-05-28
17	GLM 5.1 Thinking	53.933%	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-28
18	Qwen 3.6 Max Preview	51.685%	Qwen3.6 Max Preview qwen-qwen3.6-max-preview	Imported	2026-05-28
19	Gemini 3 Flash Preview	51.685%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-28
20	GPT 5.2 2025-12-11	51.685%	GPT-5.2 openai-gpt-5.2	Imported	2026-05-28
21	GLM 5 Thinking	49.438%	GLM GLM 5 z-ai-glm-5	Imported	2026-05-28
22	MiniMax M2.7	47.191%	MiniMax M2.7 minimax-minimax-m2.7	Imported	2026-05-28
23	Qwen 3.6 27B	44.944%	Qwen3.6 27B qwen-qwen3.6-27b	Imported	2026-05-28
24	Qwen 3.6 Plus	44.944%	Qwen3.6 Plus qwen-qwen3.6-plus	Imported	2026-05-28
25	GPT 5.1 2025-11-13	44.944%	GPT-5.1 openai-gpt-5.1	Imported	2026-05-28
26	GPT 5.4 Mini 2026-03-17	44.944%	GPT-5.4 Mini openai-gpt-5.4-mini	Imported	2026-05-28
27	Grok 4.3	43.446%	GROK Grok 4.3 x-ai-grok-4.3	Imported	2026-05-28
28	Qwen 3.5 Plus Thinking	41.573%	—	Imported	2026-05-28
29	Claude Sonnet 4.5 20250929 Thinking	41.573%	Claude Sonnet 4.5 anthropic-claude-sonnet-4.5	Imported	2026-05-28
30	MiniMax M2.5 Lightning	41.573%	—	Imported	2026-05-28
31	Grok 4.20 0309 Reasoning	40.449%	GROK Grok 4.20 x-ai-grok-4.20	Imported	2026-05-28
32	Kimi K2.5 Thinking	40.449%	KIMI MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5	Imported	2026-05-28
33	GPT 5.4 Nano 2026-03-17	39.888%	GPT-5.4 Nano openai-gpt-5.4-nano	Imported	2026-05-28
34	Gemma 4 31B It	39.326%	Gemma 4 31B google-gemma-4-31b-it	Imported	2026-05-28
35	Claude Haiku 4.5 20251001 Thinking	38.202%	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-28
36	GLM 4.7	38.202%	GLM GLM 4.7 z-ai-glm-4.7	Imported	2026-05-28
37	Kimi K2 Thinking	37.079%	KIMI MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking	Imported	2026-05-28
38	MiniMax M2.1	37.079%	MiniMax M2.1 minimax-minimax-m2.1	Imported	2026-05-28
39	GPT 5.2025-08-07	37.079%	GPT-5 openai-gpt-5	Imported	2026-05-28
40	DeepSeek V3P2 Thinking	35.955%	—	Imported	2026-05-28
41	DeepSeek V3P2	34.831%	—	Imported	2026-05-28
42	Gemini 2.5 Pro	30.337%	Gemini 2.5 Pro google-gemini-2.5-pro	Imported	2026-05-28
43	Mistral Medium 3.5	30.337%	Mistral: Mistral Medium 3.5 mistralai-mistral-medium-3-5	Imported	2026-05-28
44	Grok 4 Fast Reasoning	29.213%	GROK Grok 4 Fast x-ai-grok-4-fast	Imported	2026-05-28
45	Grok 4.0709	28.09%	GROK Grok 4 x-ai-grok-4	Imported	2026-05-28
46	GLM 4.6	28.09%	GLM GLM 4.6 z-ai-glm-4.6	Imported	2026-05-28
47	GPT 5 Mini 2025-08-07	26.966%	GPT-5 Mini openai-gpt-5-mini	Imported	2026-05-28
48	Kimi K2 Instruct	25.843%	KIMI MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2	Imported	2026-05-28
49	Qwen 3 Max	24.719%	Qwen3 Max qwen-qwen3-max	Imported	2026-05-28
50	Qwen 3.5 Flash	24.719%	Qwen3.5-Flash qwen-qwen3.5-flash-02-23	Imported	2026-05-28
51	Gemini 3.1 Flash Lite Preview	24.719%	Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview	Imported	2026-05-28
52	Grok 4.1 Fast Reasoning	24.719%	GROK Grok 4.1 Fast x-ai-grok-4.1-fast	Imported	2026-05-28
53	DeepSeek V3P1	22.472%	—	Imported	2026-05-28
54	Gemini 2.5 Flash Preview 09 2025 Thinking	21.348%	—	Imported	2026-05-28
55	Qwen 3 Max 2026-01-23	20.225%	—	Imported	2026-05-28
56	GPT Oss 120B	19.101%	gpt-oss-120b openai-gpt-oss-120b	Imported	2026-05-28
57	Trinity Large Thinking	17.978%	A Trinity Large Thinking arcee-ai-trinity-large-thinking	Imported	2026-05-28
58	Grok 4.1 Fast Non Reasoning	17.978%	GROK Grok 4.1 Fast x-ai-grok-4.1-fast	Imported	2026-05-28
59	Command A Plus 05 2026	16.854%	—	Imported	2026-05-28
60	Mistral Small 2603	16.854%	Mistral: Mistral Small 4 mistralai-mistral-small-2603	Imported	2026-05-28
61	GPT 4.1 2025-04-14	14.607%	GPT-4.1 openai-gpt-4.1	Imported	2026-05-28
62	Magistral Medium 2509	13.483%	—	Imported	2026-05-28
63	Mistral Large 2512	8.989%	Mistral: Mistral Large 3 2512 mistralai-mistral-large-2512	Imported	2026-05-28
64	Command A 03 2025	2.247%	C Command A cohere-command-a	Imported	2026-05-28
65	Llama4 Maverick Instruct Basic	2.247%	—	Imported	2026-05-28
1	Qwen3.7 Max	69.7%	Qwen3.7 Max qwen-qwen3.7-max	Self-reported	2026-05-28
2	DeepSeek V4 Pro Max	67.9%	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Self-reported	2026-05-28
3	Kimi K2.6 Thinking	66.7%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Self-reported	2026-05-28
4	Claude Opus 4.6 Max	65.4%	Claude Opus 4.6 anthropic-claude-opus-4.6	Self-reported	2026-05-28
5	GLM-5.1 Thinking	63.5%	GLM GLM 5.1 z-ai-glm-5.1	Self-reported	2026-05-28
6	Qwen3.6 Plus	61.6%	Qwen3.6 Plus qwen-qwen3.6-plus	Self-reported	2026-05-28
1	GPT-5.5	82.7%	GPT-5.5 openai-gpt-5.5	Launch post	2026-04-23
2	GPT-5.4	75.1%	GPT-5.4 openai-gpt-5.4	Launch post	2026-04-23
3	Claude Opus 4.7	69.4%	Claude Opus 4.7 anthropic-claude-opus-4.7	Launch post	2026-04-23
4	Gemini 3.1 Pro Preview	68.5%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Launch post	2026-04-23
1	Claude Mythos Preview	82%	Claude Mythos Preview anthropic-claude-mythos-preview	Launch post	2026-04-16
2	GPT-5.4	75.1%	GPT-5.4 openai-gpt-5.4	Launch post	2026-04-16
3	Claude Opus 4.7	69.4%	Claude Opus 4.7 anthropic-claude-opus-4.7	Launch post	2026-04-16
4	Gemini 3.1 Pro Preview	68.5%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Launch post	2026-04-16
5	Claude Opus 4.6	65.4%	Claude Opus 4.6 anthropic-claude-opus-4.6	Launch post	2026-04-16

Metadata

Metrics

Latest Results