ScienceAgentBench (HAL)

HAL's standardized, cost-aware agent leaderboard for ScienceAgentBench scientific agent tasks.

23rows
accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Accuracy, Cost (USD) (lower is better), Runs

Latest Results

Rows are parsed from the public HAL static leaderboard table. Source scaffold/model display names are preserved; score is the table's Accuracy percentage.

Rank Subject Accuracy Model Match Provenance Sampled
1 SAB Self-Debug / o3 Medium (April 2025) 33.33 Verified 2026-05-27
2 SAB Self-Debug / Claude Sonnet 4.5 High (September 2025) 30.39 Verified 2026-05-27
3 SAB Self-Debug / Claude-3.7 Sonnet High (February 2025) 30.39 Verified 2026-05-27
4 SAB Self-Debug / GPT-5 Medium (August 2025) 30.39 Verified 2026-05-27
5 SAB Self-Debug / Claude Sonnet 4.5 (September 2025) 29.41 Verified 2026-05-27
6 SAB Self-Debug / o4-mini Low (April 2025) 27.45 Verified 2026-05-27
7 SAB Self-Debug / o4-mini High (April 2025) 27.45 Verified 2026-05-27
8 SAB Self-Debug / Claude Opus 4.1 (August 2025) 27.45 Verified 2026-05-27
9 SAB Self-Debug / Claude Opus 4.1 High (August 2025) 26.47 Verified 2026-05-27
10 SAB Self-Debug / GPT-4.1 (April 2025) 24.51 Verified 2026-05-27
11 SAB Self-Debug / Claude Haiku 4.5 High (October 2025) 23.53 Verified 2026-05-27
12 SAB Self-Debug / DeepSeek R1 (January 2025) 23.53 Verified 2026-05-27
13 SAB Self-Debug / Claude-3.7 Sonnet (February 2025) 22.55 Verified 2026-05-27
14 HAL Generalist Agent / o4-mini High (April 2025) 21.57 Verified 2026-05-27
15 HAL Generalist Agent / o4-mini Low (April 2025) 19.61 Verified 2026-05-27
16 SAB Self-Debug / Claude Haiku 4.5 (October 2025) 18.63 Verified 2026-05-27
17 HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) 17.65 Verified 2026-05-27
18 SAB Self-Debug / DeepSeek V3 (March 2025) 15.69 Verified 2026-05-27
19 SAB Self-Debug / Gemini 2.0 Flash (February 2025) 12.75 Verified 2026-05-27
20 HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) 10.78 Verified 2026-05-27
21 HAL Generalist Agent / o3 Medium (April 2025) 9.8 Verified 2026-05-27
22 HAL Generalist Agent / GPT-4.1 (April 2025) 6.86 Verified 2026-05-27
23 HAL Generalist Agent / DeepSeek V3 (March 2025) 0.98 Verified 2026-05-27