TAU-bench Airline (HAL)

HAL's standardized, cost-aware agent leaderboard for TAU-bench Airline customer-service tasks.

26rows
accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Accuracy, Cost (USD) (lower is better), Runs

Latest Results

Rows are parsed from the public HAL static leaderboard table. Source scaffold/model display names are preserved; score is the table's Accuracy percentage.

Rank Subject Accuracy Model Match Provenance Sampled
1 TAU-bench Tool Calling / o4-mini High (April 2025) 56 Verified 2026-05-27
2 HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) 56 Verified 2026-05-27
3 TAU-bench Tool Calling / o3 Medium (April 2025) 54 Verified 2026-05-27
4 HAL Generalist Agent / Claude Opus 4.1 (August 2025) 54 Verified 2026-05-27
5 TAU-bench Tool Calling / Claude-3.7 Sonnet High (February 2025) 52 Verified 2026-05-27
6 TAU-bench Tool Calling / Claude Opus 4.1 High (August 2025) 52 Verified 2026-05-27
7 TAU-bench Tool Calling / Claude Opus 4.1 (August 2025) 50 Verified 2026-05-27
8 TAU-bench Tool Calling / GPT-5 Medium (August 2025) 48 Verified 2026-05-27
9 TAU-bench Tool Calling / DeepSeek V3 (March 2025) 44 Verified 2026-05-27
10 TAU-bench Tool Calling / Claude-3.7 Sonnet (February 2025) 44 Verified 2026-05-27
11 HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) 44 Verified 2026-05-27
12 HAL Generalist Agent / Claude Opus 4 (May 2025) 44 Verified 2026-05-27
13 HAL Generalist Agent / Claude Opus 4 High (May 2025) 44 Verified 2026-05-27
14 TAU-bench Tool Calling / o4-mini Low (April 2025) 36 Verified 2026-05-27
15 TAU-bench Tool Calling / GPT-4.1 (April 2025) 36 Verified 2026-05-27
16 TAU-bench Tool Calling / DeepSeek R1 (January 2025) 36 Verified 2026-05-27
17 HAL Generalist Agent / Claude Opus 4.1 High (August 2025) 32 Verified 2026-05-27
18 HAL Generalist Agent / GPT-5 Medium (August 2025) 30 Verified 2026-05-27
19 TAU-bench Tool Calling / Gemini 2.0 Flash High (February 2025) 28 Verified 2026-05-27
20 HAL Generalist Agent / Gemini 2.0 Flash (February 2025) 22 Verified 2026-05-27
21 HAL Generalist Agent / o4-mini Low (April 2025) 22 Verified 2026-05-27
22 HAL Generalist Agent / o3 Medium (April 2025) 20 Verified 2026-05-27
23 HAL Generalist Agent / DeepSeek V3 (March 2025) 18 Verified 2026-05-27
24 HAL Generalist Agent / o4-mini High (April 2025) 18 Verified 2026-05-27
25 HAL Generalist Agent / GPT-4.1 (April 2025) 16 Verified 2026-05-27
26 HAL Generalist Agent / DeepSeek R1 (January 2025) 10 Verified 2026-05-27