PostTrainBench

Benchmark measuring how well CLI agents can autonomously post-train small language models under a fixed H100 and 10-hour budget.

30rows
average_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Avg, Std. Dev. (lower is better), Runs, AIME 2025, Arena Hard, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Hours Spent (lower is better)

Latest Results

Rows ranked by highest average score, excluding baseline rows from rank numbering as in the source table.

Rank Subject Avg Model Match Provenance Sampled
1 Opus 4.7 28.56 Imported 2026-05-06
2 GPT 5.5 25.02 Imported 2026-05-06
3 Opus 4.6 (1M) 24.82 Imported 2026-05-06
4 Opus 4.6 23.16 Imported 2026-05-06
5 Gemini 3.1 Pro 21.59 Imported 2026-05-06
6 GPT-5.2 21.38 Imported 2026-05-06
7 GPT 5.4 20.23 Imported 2026-05-06
8 GPT 5.1 Codex Max 19.68 Imported 2026-05-06
9 Gemini 3 Pro 18.12 Imported 2026-05-06
10 GPT 5.3 Codex 17.76 Imported 2026-05-06
11 GPT 5.2 Codex 17.22 Imported 2026-05-06
12 Opus 4.5 17.14 Imported 2026-05-06
13 GPT 5.3 Codex 13.77 Imported 2026-05-06
14 Official Instruct Models 7.30 Imported 2026-05-06
14 GPT 5.5 4.05 Imported 2026-05-06
15 GPT 5.4 4.03 Imported 2026-05-06
17 Base Models 2.58 Imported 2026-05-06
16 Opus 4.5 2.47 Imported 2026-05-06
17 Sonnet 4.6 2.34 Imported 2026-05-06
18 Gemini 3 Pro 2.13 Imported 2026-05-06
19 GLM 5 1.98 Imported 2026-05-06
20 Kimi K2.5 1.47 Imported 2026-05-06
21 Sonnet 4.5 1.42 Imported 2026-05-06
22 MiniMax M2.5 1.35 Imported 2026-05-06
23 MiniMax M2.1 1.33 Imported 2026-05-06
24 GPT 5.1 Codex Max 1.09 Imported 2026-05-06
27 Base Models 1.08 Imported 2026-05-06
25 GLM 4.7 1.07 Imported 2026-05-06
26 Qwen3 Max 1.06 Imported 2026-05-06
27 Kimi K2 Thinking 1.04 Imported 2026-05-06