RealDataAgentBench
Data-science agent benchmark evaluating whether LLM agents solve real-data analysis tasks correctly and robustly across correctness, code quality, efficiency, and statistical validity.
12rows
rdab_scoreprimary metric
2026-04-28sampled
Metadata
Metrics
RDAB Score, RDAB Score Std (lower is better), 95% CI Lower, 95% CI Upper, Avg Cost (lower is better), Total Cost (lower is better), Tasks Run, Total Runs
| Rank | Subject | RDAB Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-4.1 | 0.88 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-04-28 |
| 2 | gpt-4.1-mini | 0.87 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-04-28 |
| 3 | claude-sonnet-4-6 | 0.86 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-04-28 |
| 4 | gpt-4o | 0.85 | GPT-4o openai-gpt-4o | Imported | 2026-04-28 |
| 5 | claude-opus-4-6 | 0.85 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-04-28 |
| 6 | grok-3-mini | 0.83 | Grok 3 Mini x-ai-grok-3-mini | Imported | 2026-04-28 |
| 7 | claude-haiku-4-5-20251001 | 0.80 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-04-28 |
| 8 | llama-3.3-70b-versatile | 0.80 | — | Imported | 2026-04-28 |
| 9 | gpt-4o-mini | 0.78 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-04-28 |
| 10 | gpt-5 | 0.78 | GPT-5 openai-gpt-5 | Imported | 2026-04-28 |
| 11 | gemini-2.5-flash | 0.66 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-04-28 |
| 12 | gpt-4.1-nano | 0.62 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-04-28 |
No matching rows.