SLDBench

Scaling Law Discovery benchmark evaluating AI agents on discovering mathematical scaling laws from experimental LLM training data.

27rows
mean_reward_r2primary metric
2026-05-06sampled

Metadata

Metrics

Mean Reward R2, Task Count, Data Constrained Scaling Law Reward R2, Domain Mixture Scaling Law Reward R2, Easy Question Scaling Law Reward R2, Lr Bsz Scaling Law Reward R2, Moe Scaling Law Reward R2, Parallel Scaling Law Reward R2, Sft Scaling Law Reward R2, Vocab Scaling Law Reward R2

Latest Results

Rows are parsed from the public Hugging Face dataset-server rows API. Results are aggregated by source agent_name and model_name using the mean R2 across available task splits, with each split metric preserved.

Rank Subject Mean Reward R2 Model Match Provenance Sampled
1 SLDAgent + o4-mini 0.62 Imported 2026-05-06
2 SLDAgent + gpt-5 0.62 Imported 2026-05-06
3 SLDAgent + gemini-3-pro-preview 0.59 Imported 2026-05-06
4 human + human 0.52 Imported 2026-05-06
5 goose + gpt-5 0.51 Imported 2026-05-06
6 gemini-cli + gemini-3-pro-preview 0.48 Imported 2026-05-06
7 SLDAgent + claude-sonnet-4-5-20250929 0.46 Imported 2026-05-06
8 SLDAgent + gemini-2.5-flash 0.45 Imported 2026-05-06
9 SLDAgent + claude-haiku-4-5-20251001 0.43 Imported 2026-05-06
10 claude-code + claude-sonnet-4-5 0.42 Imported 2026-05-06
11 codex + o4-mini 0.37 Imported 2026-05-06
12 openhands + gpt-5 0.34 Imported 2026-05-06
13 opencode + gpt-5 0.28 Imported 2026-05-06
14 claude-code + claude-haiku-4-5 0.22 Imported 2026-05-06
15 codex + gpt-5 0.22 Imported 2026-05-06
16 openhands + gpt-5.2 0.12 Imported 2026-05-06
17 openhands + gpt-4.1 0.11 Imported 2026-05-06
18 mini-swe-agent + gpt-5 0.02 Imported 2026-05-06
19 openhands + gemini-3-flash-preview -0.03 Imported 2026-05-06
20 openhands + o4-mini -0.26 Imported 2026-05-06
21 openhands + o3 -0.27 Imported 2026-05-06
22 terminus-2 + gpt-5 -0.48 Imported 2026-05-06
23 openhands + DeepSeek-V3.2-reasoning -0.50 Imported 2026-05-06
24 openhands + DeepSeek-V3.2 -0.56 Imported 2026-05-06
25 aider + gpt-5 -0.62 Imported 2026-05-06
26 gemini-cli + gemini-2.5-flash -0.75 Imported 2026-05-06
27 openhands + gpt-4o -1 Imported 2026-05-06