AIRS-Bench
AI Research Science benchmark evaluating autonomous ML research agents across 20 tasks sourced from state-of-the-art papers in NLP, code, math, biochemical modelling, and time-series forecasting.
14rows
avg_normalized_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Avg. norm. score, Avg. norm. score std (lower is better), # seeds
| Rank | Subject | Avg. norm. score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Greedy gpt-oss-120b | 0.40 | — | Imported | 2026-05-06 |
| 2 | Greedy gpt-oss-20b | 0.40 | — | Imported | 2026-05-06 |
| 3 | Greedy o3-mini | 0.39 | — | Imported | 2026-05-06 |
| 4 | Greedy GPT-4o | 0.31 | — | Imported | 2026-05-06 |
| 5 | MLGym CWM | 0.30 | — | Imported | 2026-05-06 |
| 6 | Greedy CWM | 0.29 | — | Imported | 2026-05-06 |
| 7 | Greedy Devstral | 0.18 | — | Imported | 2026-05-06 |
| 8 | MLGym GPT-4o | 0.18 | — | Imported | 2026-05-06 |
| 9 | One-Shot o3-mini | 0.17 | — | Imported | 2026-05-06 |
| 10 | One-Shot gpt-oss-120b | 0.16 | — | Imported | 2026-05-06 |
| 11 | One-Shot gpt-oss-20b | 0.08 | — | Imported | 2026-05-06 |
| 12 | One-Shot GPT-4o | 0.06 | — | Imported | 2026-05-06 |
| 13 | One-Shot CWM | 0.04 | — | Imported | 2026-05-06 |
| 14 | One-Shot Devstral | 0.02 | — | Imported | 2026-05-06 |
No matching rows.