SLDBench
Scaling Law Discovery benchmark evaluating AI agents on discovering mathematical scaling laws from experimental LLM training data.
27rows
mean_reward_r2primary metric
2026-05-06sampled
Metadata
Metrics
Mean Reward R2, Task Count, Data Constrained Scaling Law Reward R2, Domain Mixture Scaling Law Reward R2, Easy Question Scaling Law Reward R2, Lr Bsz Scaling Law Reward R2, Moe Scaling Law Reward R2, Parallel Scaling Law Reward R2, Sft Scaling Law Reward R2, Vocab Scaling Law Reward R2
| Rank | Subject | Mean Reward R2 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | SLDAgent + o4-mini | 0.62 | — | Imported | 2026-05-06 |
| 2 | SLDAgent + gpt-5 | 0.62 | — | Imported | 2026-05-06 |
| 3 | SLDAgent + gemini-3-pro-preview | 0.59 | — | Imported | 2026-05-06 |
| 4 | human + human | 0.52 | — | Imported | 2026-05-06 |
| 5 | goose + gpt-5 | 0.51 | — | Imported | 2026-05-06 |
| 6 | gemini-cli + gemini-3-pro-preview | 0.48 | — | Imported | 2026-05-06 |
| 7 | SLDAgent + claude-sonnet-4-5-20250929 | 0.46 | — | Imported | 2026-05-06 |
| 8 | SLDAgent + gemini-2.5-flash | 0.45 | — | Imported | 2026-05-06 |
| 9 | SLDAgent + claude-haiku-4-5-20251001 | 0.43 | — | Imported | 2026-05-06 |
| 10 | claude-code + claude-sonnet-4-5 | 0.42 | — | Imported | 2026-05-06 |
| 11 | codex + o4-mini | 0.37 | — | Imported | 2026-05-06 |
| 12 | openhands + gpt-5 | 0.34 | — | Imported | 2026-05-06 |
| 13 | opencode + gpt-5 | 0.28 | — | Imported | 2026-05-06 |
| 14 | claude-code + claude-haiku-4-5 | 0.22 | — | Imported | 2026-05-06 |
| 15 | codex + gpt-5 | 0.22 | — | Imported | 2026-05-06 |
| 16 | openhands + gpt-5.2 | 0.12 | — | Imported | 2026-05-06 |
| 17 | openhands + gpt-4.1 | 0.11 | — | Imported | 2026-05-06 |
| 18 | mini-swe-agent + gpt-5 | 0.02 | — | Imported | 2026-05-06 |
| 19 | openhands + gemini-3-flash-preview | -0.03 | — | Imported | 2026-05-06 |
| 20 | openhands + o4-mini | -0.26 | — | Imported | 2026-05-06 |
| 21 | openhands + o3 | -0.27 | — | Imported | 2026-05-06 |
| 22 | terminus-2 + gpt-5 | -0.48 | — | Imported | 2026-05-06 |
| 23 | openhands + DeepSeek-V3.2-reasoning | -0.50 | — | Imported | 2026-05-06 |
| 24 | openhands + DeepSeek-V3.2 | -0.56 | — | Imported | 2026-05-06 |
| 25 | aider + gpt-5 | -0.62 | — | Imported | 2026-05-06 |
| 26 | gemini-cli + gemini-2.5-flash | -0.75 | — | Imported | 2026-05-06 |
| 27 | openhands + gpt-4o | -1 | — | Imported | 2026-05-06 |
No matching rows.