ScienceAgentBench (HAL)
HAL's standardized, cost-aware agent leaderboard for ScienceAgentBench scientific agent tasks.
23rows
accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Accuracy, Cost (USD) (lower is better), Runs
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | SAB Self-Debug / o3 Medium (April 2025) | 33.33 | — | Verified | 2026-05-27 |
| 2 | SAB Self-Debug / Claude Sonnet 4.5 High (September 2025) | 30.39 | — | Verified | 2026-05-27 |
| 3 | SAB Self-Debug / Claude-3.7 Sonnet High (February 2025) | 30.39 | — | Verified | 2026-05-27 |
| 4 | SAB Self-Debug / GPT-5 Medium (August 2025) | 30.39 | — | Verified | 2026-05-27 |
| 5 | SAB Self-Debug / Claude Sonnet 4.5 (September 2025) | 29.41 | — | Verified | 2026-05-27 |
| 6 | SAB Self-Debug / o4-mini Low (April 2025) | 27.45 | — | Verified | 2026-05-27 |
| 7 | SAB Self-Debug / o4-mini High (April 2025) | 27.45 | — | Verified | 2026-05-27 |
| 8 | SAB Self-Debug / Claude Opus 4.1 (August 2025) | 27.45 | — | Verified | 2026-05-27 |
| 9 | SAB Self-Debug / Claude Opus 4.1 High (August 2025) | 26.47 | — | Verified | 2026-05-27 |
| 10 | SAB Self-Debug / GPT-4.1 (April 2025) | 24.51 | — | Verified | 2026-05-27 |
| 11 | SAB Self-Debug / Claude Haiku 4.5 High (October 2025) | 23.53 | — | Verified | 2026-05-27 |
| 12 | SAB Self-Debug / DeepSeek R1 (January 2025) | 23.53 | — | Verified | 2026-05-27 |
| 13 | SAB Self-Debug / Claude-3.7 Sonnet (February 2025) | 22.55 | — | Verified | 2026-05-27 |
| 14 | HAL Generalist Agent / o4-mini High (April 2025) | 21.57 | — | Verified | 2026-05-27 |
| 15 | HAL Generalist Agent / o4-mini Low (April 2025) | 19.61 | — | Verified | 2026-05-27 |
| 16 | SAB Self-Debug / Claude Haiku 4.5 (October 2025) | 18.63 | — | Verified | 2026-05-27 |
| 17 | HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) | 17.65 | — | Verified | 2026-05-27 |
| 18 | SAB Self-Debug / DeepSeek V3 (March 2025) | 15.69 | — | Verified | 2026-05-27 |
| 19 | SAB Self-Debug / Gemini 2.0 Flash (February 2025) | 12.75 | — | Verified | 2026-05-27 |
| 20 | HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) | 10.78 | — | Verified | 2026-05-27 |
| 21 | HAL Generalist Agent / o3 Medium (April 2025) | 9.8 | — | Verified | 2026-05-27 |
| 22 | HAL Generalist Agent / GPT-4.1 (April 2025) | 6.86 | — | Verified | 2026-05-27 |
| 23 | HAL Generalist Agent / DeepSeek V3 (March 2025) | 0.98 | — | Verified | 2026-05-27 |
No matching rows.