Terminal Bench

Terminal and command-line interaction tasks for evaluating agent performance.

9rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.2 64.90 Imported 2026-05-06
2 Gemini 3 Pro 64.30 Imported 2026-05-06
3 Claude Opus 4.5 63.10 Imported 2026-05-06
4 kimi-k2-thinking (official) 35.70 Imported 2026-05-06
5 Gemini 2.5 Pro (Jun 2025) 32.60 Imported 2026-05-06
6 Grok 4 27.20 Imported 2026-05-06
7 Grok Code Fast 1 25.80 Imported 2026-05-06
8 Qwen3-Max-Instruct 25.40 Imported 2026-05-06
9 GPT-OSS 120B 18.70 Imported 2026-05-06