PostTrainBench
Benchmark measuring how well CLI agents can autonomously post-train small language models under a fixed H100 and 10-hour budget.
30rows
average_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Avg, Std. Dev. (lower is better), Runs, AIME 2025, Arena Hard, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Hours Spent (lower is better)
| Rank | Subject | Avg | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Opus 4.7 | 28.56 | — | Imported | 2026-05-06 |
| 2 | GPT 5.5 | 25.02 | — | Imported | 2026-05-06 |
| 3 | Opus 4.6 (1M) | 24.82 | — | Imported | 2026-05-06 |
| 4 | Opus 4.6 | 23.16 | — | Imported | 2026-05-06 |
| 5 | Gemini 3.1 Pro | 21.59 | — | Imported | 2026-05-06 |
| 6 | GPT-5.2 | 21.38 | — | Imported | 2026-05-06 |
| 7 | GPT 5.4 | 20.23 | — | Imported | 2026-05-06 |
| 8 | GPT 5.1 Codex Max | 19.68 | — | Imported | 2026-05-06 |
| 9 | Gemini 3 Pro | 18.12 | — | Imported | 2026-05-06 |
| 10 | GPT 5.3 Codex | 17.76 | — | Imported | 2026-05-06 |
| 11 | GPT 5.2 Codex | 17.22 | — | Imported | 2026-05-06 |
| 12 | Opus 4.5 | 17.14 | — | Imported | 2026-05-06 |
| 13 | GPT 5.3 Codex | 13.77 | — | Imported | 2026-05-06 |
| 14 | Official Instruct Models | 7.30 | — | Imported | 2026-05-06 |
| 14 | GPT 5.5 | 4.05 | — | Imported | 2026-05-06 |
| 15 | GPT 5.4 | 4.03 | — | Imported | 2026-05-06 |
| 17 | Base Models | 2.58 | — | Imported | 2026-05-06 |
| 16 | Opus 4.5 | 2.47 | — | Imported | 2026-05-06 |
| 17 | Sonnet 4.6 | 2.34 | — | Imported | 2026-05-06 |
| 18 | Gemini 3 Pro | 2.13 | — | Imported | 2026-05-06 |
| 19 | GLM 5 | 1.98 | — | Imported | 2026-05-06 |
| 20 | Kimi K2.5 | 1.47 | — | Imported | 2026-05-06 |
| 21 | Sonnet 4.5 | 1.42 | — | Imported | 2026-05-06 |
| 22 | MiniMax M2.5 | 1.35 | — | Imported | 2026-05-06 |
| 23 | MiniMax M2.1 | 1.33 | — | Imported | 2026-05-06 |
| 24 | GPT 5.1 Codex Max | 1.09 | — | Imported | 2026-05-06 |
| 27 | Base Models | 1.08 | — | Imported | 2026-05-06 |
| 25 | GLM 4.7 | 1.07 | — | Imported | 2026-05-06 |
| 26 | Qwen3 Max | 1.06 | — | Imported | 2026-05-06 |
| 27 | Kimi K2 Thinking | 1.04 | — | Imported | 2026-05-06 |
No matching rows.