Terminal Bench
Terminal and command-line interaction tasks for evaluating agent performance.
9rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Standard error (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.2 | 64.90 | — | Imported | 2026-05-06 |
| 2 | Gemini 3 Pro | 64.30 | — | Imported | 2026-05-06 |
| 3 | Claude Opus 4.5 | 63.10 | — | Imported | 2026-05-06 |
| 4 | kimi-k2-thinking (official) | 35.70 | — | Imported | 2026-05-06 |
| 5 | Gemini 2.5 Pro (Jun 2025) | 32.60 | — | Imported | 2026-05-06 |
| 6 | Grok 4 | 27.20 | — | Imported | 2026-05-06 |
| 7 | Grok Code Fast 1 | 25.80 | — | Imported | 2026-05-06 |
| 8 | Qwen3-Max-Instruct | 25.40 | — | Imported | 2026-05-06 |
| 9 | GPT-OSS 120B | 18.70 | — | Imported | 2026-05-06 |
No matching rows.