GAIA (HAL)

HAL's standardized, cost-aware agent leaderboard for GAIA web assistance tasks.

32rows
accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Accuracy, Level 1, Level 2, Level 3, Cost (USD) (lower is better), Runs

Latest Results

Rows are parsed from the public HAL static leaderboard table. Source scaffold/model display names are preserved; score is the table's Accuracy percentage.

Rank Subject Accuracy Model Match Provenance Sampled
1 HAL Generalist Agent / Claude Sonnet 4.5 (September 2025) 74.55 Verified 2026-05-27
2 HAL Generalist Agent / Claude Sonnet 4.5 High (September 2025) 70.91 Verified 2026-05-27
3 HAL Generalist Agent / Claude Opus 4.1 High (August 2025) 68.48 Verified 2026-05-27
4 HAL Generalist Agent / Claude Opus 4 High (May 2025) 64.85 Verified 2026-05-27
5 HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) 64.24 Verified 2026-05-27
6 HAL Generalist Agent / Claude Opus 4.1 (August 2025) 64.24 Verified 2026-05-27
7 HF Open Deep Research / GPT-5 Medium (August 2025) 62.8 Verified 2026-05-27
8 HAL Generalist Agent / GPT-5 Medium (August 2025) 59.39 Verified 2026-05-27
9 HAL Generalist Agent / o4-mini Low (April 2025) 58.18 Verified 2026-05-27
10 HF Open Deep Research / Claude Opus 4 (May 2025) 57.58 Verified 2026-05-27
11 HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) 56.36 Verified 2026-05-27
12 HAL Generalist Agent / Claude Haiku 4.5 (October 2025) 56.36 Verified 2026-05-27
13 HF Open Deep Research / o4-mini High (April 2025) 55.76 Verified 2026-05-27
14 HAL Generalist Agent / o4-mini High (April 2025) 54.55 Verified 2026-05-27
15 HF Open Deep Research / GPT-4.1 (April 2025) 50.3 Verified 2026-05-27
16 HAL Generalist Agent / GPT-4.1 (April 2025) 49.7 Verified 2026-05-27
17 HF Open Deep Research / o4-mini Low (April 2025) 47.88 Verified 2026-05-27
18 HF Open Deep Research / Claude-3.7 Sonnet (February 2025) 36.97 Verified 2026-05-27
19 HF Open Deep Research / Claude-3.7 Sonnet High (February 2025) 35.76 Verified 2026-05-27
20 HAL Generalist Agent / Gemini 2.0 Flash (February 2025) 32.73 Verified 2026-05-27
21 HF Open Deep Research / o3 Medium (April 2025) 32.73 Verified 2026-05-27
22 HF Open Deep Research / Claude Sonnet 4.5 (September 2025) 30.91 Verified 2026-05-27
23 HF Open Deep Research / Claude Sonnet 4.5 High (September 2025) 30.91 Verified 2026-05-27
24 HAL Generalist Agent / DeepSeek R1 (January 2025) 30.3 Verified 2026-05-27
25 HAL Generalist Agent / Claude Opus 4 (May 2025) 30.3 Verified 2026-05-27
26 HAL Generalist Agent / DeepSeek V3 (March 2025) 29.39 Verified 2026-05-27
27 HF Open Deep Research / DeepSeek V3 (March 2025) 28.48 Verified 2026-05-27
28 HF Open Deep Research / Claude Opus 4.1 (August 2025) 28.48 Verified 2026-05-27
29 HAL Generalist Agent / o3 Medium (April 2025) 28.48 Verified 2026-05-27
30 HF Open Deep Research / Claude Opus 4.1 High (August 2025) 25.45 Verified 2026-05-27
31 HF Open Deep Research / DeepSeek R1 (January 2025) 24.85 Verified 2026-05-27
32 HF Open Deep Research / Gemini 2.0 Flash (February 2025) 19.39 Verified 2026-05-27