RealDataAgentBench

Data-science agent benchmark evaluating whether LLM agents solve real-data analysis tasks correctly and robustly across correctness, code quality, efficiency, and statistical validity.

12rows
rdab_scoreprimary metric
2026-04-28sampled

Metadata

Metrics

RDAB Score, RDAB Score Std (lower is better), 95% CI Lower, 95% CI Upper, Avg Cost (lower is better), Total Cost (lower is better), Tasks Run, Total Runs

Latest Results

Rows ranked by highest average RDAB Score.

Rank Subject RDAB Score Model Match Provenance Sampled
1 gpt-4.1 0.88 GPT-4.1
openai-gpt-4.1
Imported 2026-04-28
2 gpt-4.1-mini 0.87 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-04-28
3 claude-sonnet-4-6 0.86 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-04-28
4 gpt-4o 0.85 GPT-4o
openai-gpt-4o
Imported 2026-04-28
5 claude-opus-4-6 0.85 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-04-28
6 grok-3-mini 0.83 GROK Grok 3 Mini
x-ai-grok-3-mini
Imported 2026-04-28
7 claude-haiku-4-5-20251001 0.80 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-04-28
8 llama-3.3-70b-versatile 0.80 Imported 2026-04-28
9 gpt-4o-mini 0.78 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-04-28
10 gpt-5 0.78 GPT-5
openai-gpt-5
Imported 2026-04-28
11 gemini-2.5-flash 0.66 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-04-28
12 gpt-4.1-nano 0.62 GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-04-28