CORE-Bench Hard
HAL's cost-aware agent leaderboard for CORE-Bench Hard scientific programming tasks.
49rows
accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Accuracy, Cost (USD) (lower is better), Runs
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Code / Claude Opus 4.5 | 77.78 | — | Verified | 2026-05-27 |
| 2 | Claude Code / Claude Sonnet 4.5 (September 2025) | 62.22 | — | Verified | 2026-05-27 |
| 3 | CORE-Agent / Claude Opus 4.1 (August 2025) | 51.11 | — | Verified | 2026-05-27 |
| 4 | Claude Code / Claude Sonnet 4 (May 2025) | 46.67 | — | Verified | 2026-05-27 |
| 5 | CORE-Agent / Claude Sonnet 4.5 High (September 2025) | 44.44 | — | Verified | 2026-05-27 |
| 6 | CORE-Agent / Claude Opus 4.5 High (November 2025) | 42.22 | — | Verified | 2026-05-27 |
| 7 | CORE-Agent / Claude Opus 4.5 (November 2025) | 42.22 | — | Verified | 2026-05-27 |
| 8 | Claude Code / Claude Opus 4.1 | 42.22 | — | Verified | 2026-05-27 |
| 9 | CORE-Agent / Claude Opus 4.1 High (August 2025) | 42.22 | — | Verified | 2026-05-27 |
| 10 | CORE-Agent / Gemini 3 Pro Preview High (November 2025) | 40 | — | Verified | 2026-05-27 |
| 11 | HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) | 37.78 | — | Verified | 2026-05-27 |
| 12 | CORE-Agent / Claude Sonnet 4.5 (September 2025) | 37.78 | — | Verified | 2026-05-27 |
| 13 | HAL Generalist Agent / o4-mini High (April 2025) | 35.56 | — | Verified | 2026-05-27 |
| 14 | CORE-Agent / Claude-3.7 Sonnet (February 2025) | 35.56 | — | Verified | 2026-05-27 |
| 15 | HAL Generalist Agent / Gemini 3 Pro Preview High (November 2025) | 35.56 | — | Verified | 2026-05-27 |
| 16 | HAL Generalist Agent / Claude Opus 4.1 (August 2025) | 35.56 | — | Verified | 2026-05-27 |
| 17 | HAL Generalist Agent / Claude Sonnet 4.5 (September 2025) | 33.33 | — | Verified | 2026-05-27 |
| 18 | CORE-Agent / Claude Sonnet 4 High (May 2025) | 33.33 | — | Verified | 2026-05-27 |
| 19 | CORE-Agent / GPT-4.1 (April 2025) | 33.33 | — | Verified | 2026-05-27 |
| 20 | HAL Generalist Agent / Claude Opus 4.5 (November 2025) | 33.33 | — | Verified | 2026-05-27 |
| 21 | HAL Generalist Agent / Claude Opus 4.1 High (August 2025) | 33.33 | — | Verified | 2026-05-27 |
| 22 | HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) | 31.11 | — | Verified | 2026-05-27 |
| 23 | HAL Generalist Agent / Claude Opus 4.5 High (November 2025) | 31.11 | — | Verified | 2026-05-27 |
| 24 | CORE-Agent / Claude Sonnet 4 (May 2025) | 28.89 | — | Verified | 2026-05-27 |
| 25 | HAL Generalist Agent / Claude Sonnet 4.5 High (September 2025) | 28.89 | — | Verified | 2026-05-27 |
| 26 | CORE-Agent / GPT-5 Medium (August 2025) | 26.67 | — | Verified | 2026-05-27 |
| 27 | CORE-Agent / o4-mini High (April 2025) | 26.67 | — | Verified | 2026-05-27 |
| 28 | CORE-Agent / Claude-3.7 Sonnet High (February 2025) | 24.44 | — | Verified | 2026-05-27 |
| 29 | CORE-Agent / o3 Medium (April 2025) | 24.44 | — | Verified | 2026-05-27 |
| 30 | HAL Generalist Agent / GPT-4.1 (April 2025) | 22.22 | — | Verified | 2026-05-27 |
| 31 | HAL Generalist Agent / o3 Medium (April 2025) | 22.22 | — | Verified | 2026-05-27 |
| 32 | CORE-Agent / Gemini 2.5 Pro Preview (March 2025) | 22.22 | — | Verified | 2026-05-27 |
| 33 | CORE-Agent / DeepSeek V3.1 (August 2025) | 20 | — | Verified | 2026-05-27 |
| 34 | CORE-Agent / DeepSeek V3 (March 2025) | 17.78 | — | Verified | 2026-05-27 |
| 35 | CORE-Agent / o4-mini Low (April 2025) | 17.78 | — | Verified | 2026-05-27 |
| 36 | HAL Generalist Agent / o4-mini Low (April 2025) | 15.56 | — | Verified | 2026-05-27 |
| 37 | CORE-Agent / GPT-OSS-120B (August 2025) | 11.11 | — | Verified | 2026-05-27 |
| 38 | CORE-Agent / GPT-OSS-120B High (August 2025) | 11.11 | — | Verified | 2026-05-27 |
| 39 | CORE-Agent / Gemini 2.0 Flash (February 2025) | 11.11 | — | Verified | 2026-05-27 |
| 40 | HAL Generalist Agent / GPT-5 Medium (August 2025) | 11.11 | — | Verified | 2026-05-27 |
| 41 | CORE-Agent / Claude Haiku 4.5 (October 2025) | 11.11 | — | Verified | 2026-05-27 |
| 42 | HAL Generalist Agent / GPT-OSS-120B High (August 2025) | 8.89 | — | Verified | 2026-05-27 |
| 43 | HAL Generalist Agent / GPT-OSS-120B (August 2025) | 8.89 | — | Verified | 2026-05-27 |
| 44 | HAL Generalist Agent / DeepSeek V3 (March 2025) | 8.89 | — | Verified | 2026-05-27 |
| 45 | HAL Generalist Agent / DeepSeek R1 (May 2025) | 8.89 | — | Verified | 2026-05-27 |
| 46 | CORE-Agent / DeepSeek R1 (January 2025) | 6.67 | — | Verified | 2026-05-27 |
| 47 | HAL Generalist Agent / DeepSeek R1 (January 2025) | 4.45 | — | Verified | 2026-05-27 |
| 48 | HAL Generalist Agent / Gemini 2.0 Flash (February 2025) | 4.44 | — | Verified | 2026-05-27 |
| 49 | HAL Generalist Agent / Gemini 2.5 Pro Preview (March 2025) | 4.44 | — | Verified | 2026-05-27 |
No matching rows.