AIRS-Bench

AI Research Science benchmark evaluating autonomous ML research agents across 20 tasks sourced from state-of-the-art papers in NLP, code, math, biochemical modelling, and time-series forecasting.

14rows
avg_normalized_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Avg. norm. score, Avg. norm. score std (lower is better), # seeds

Latest Results

Rows are parsed from the public AIRS-Bench README leaderboard table. Source agent display names are preserved.

Rank Subject Avg. norm. score Model Match Provenance Sampled
1 Greedy gpt-oss-120b 0.40 Imported 2026-05-06
2 Greedy gpt-oss-20b 0.40 Imported 2026-05-06
3 Greedy o3-mini 0.39 Imported 2026-05-06
4 Greedy GPT-4o 0.31 Imported 2026-05-06
5 MLGym CWM 0.30 Imported 2026-05-06
6 Greedy CWM 0.29 Imported 2026-05-06
7 Greedy Devstral 0.18 Imported 2026-05-06
8 MLGym GPT-4o 0.18 Imported 2026-05-06
9 One-Shot o3-mini 0.17 Imported 2026-05-06
10 One-Shot gpt-oss-120b 0.16 Imported 2026-05-06
11 One-Shot gpt-oss-20b 0.08 Imported 2026-05-06
12 One-Shot GPT-4o 0.06 Imported 2026-05-06
13 One-Shot CWM 0.04 Imported 2026-05-06
14 One-Shot Devstral 0.02 Imported 2026-05-06