TAU-bench Airline (HAL)
HAL's standardized, cost-aware agent leaderboard for TAU-bench Airline customer-service tasks.
26rows
accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Accuracy, Cost (USD) (lower is better), Runs
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | TAU-bench Tool Calling / o4-mini High (April 2025) | 56 | — | Verified | 2026-05-27 |
| 2 | HAL Generalist Agent / Claude-3.7 Sonnet (February 2025) | 56 | — | Verified | 2026-05-27 |
| 3 | TAU-bench Tool Calling / o3 Medium (April 2025) | 54 | — | Verified | 2026-05-27 |
| 4 | HAL Generalist Agent / Claude Opus 4.1 (August 2025) | 54 | — | Verified | 2026-05-27 |
| 5 | TAU-bench Tool Calling / Claude-3.7 Sonnet High (February 2025) | 52 | — | Verified | 2026-05-27 |
| 6 | TAU-bench Tool Calling / Claude Opus 4.1 High (August 2025) | 52 | — | Verified | 2026-05-27 |
| 7 | TAU-bench Tool Calling / Claude Opus 4.1 (August 2025) | 50 | — | Verified | 2026-05-27 |
| 8 | TAU-bench Tool Calling / GPT-5 Medium (August 2025) | 48 | — | Verified | 2026-05-27 |
| 9 | TAU-bench Tool Calling / DeepSeek V3 (March 2025) | 44 | — | Verified | 2026-05-27 |
| 10 | TAU-bench Tool Calling / Claude-3.7 Sonnet (February 2025) | 44 | — | Verified | 2026-05-27 |
| 11 | HAL Generalist Agent / Claude-3.7 Sonnet High (February 2025) | 44 | — | Verified | 2026-05-27 |
| 12 | HAL Generalist Agent / Claude Opus 4 (May 2025) | 44 | — | Verified | 2026-05-27 |
| 13 | HAL Generalist Agent / Claude Opus 4 High (May 2025) | 44 | — | Verified | 2026-05-27 |
| 14 | TAU-bench Tool Calling / o4-mini Low (April 2025) | 36 | — | Verified | 2026-05-27 |
| 15 | TAU-bench Tool Calling / GPT-4.1 (April 2025) | 36 | — | Verified | 2026-05-27 |
| 16 | TAU-bench Tool Calling / DeepSeek R1 (January 2025) | 36 | — | Verified | 2026-05-27 |
| 17 | HAL Generalist Agent / Claude Opus 4.1 High (August 2025) | 32 | — | Verified | 2026-05-27 |
| 18 | HAL Generalist Agent / GPT-5 Medium (August 2025) | 30 | — | Verified | 2026-05-27 |
| 19 | TAU-bench Tool Calling / Gemini 2.0 Flash High (February 2025) | 28 | — | Verified | 2026-05-27 |
| 20 | HAL Generalist Agent / Gemini 2.0 Flash (February 2025) | 22 | — | Verified | 2026-05-27 |
| 21 | HAL Generalist Agent / o4-mini Low (April 2025) | 22 | — | Verified | 2026-05-27 |
| 22 | HAL Generalist Agent / o3 Medium (April 2025) | 20 | — | Verified | 2026-05-27 |
| 23 | HAL Generalist Agent / DeepSeek V3 (March 2025) | 18 | — | Verified | 2026-05-27 |
| 24 | HAL Generalist Agent / o4-mini High (April 2025) | 18 | — | Verified | 2026-05-27 |
| 25 | HAL Generalist Agent / GPT-4.1 (April 2025) | 16 | — | Verified | 2026-05-27 |
| 26 | HAL Generalist Agent / DeepSeek R1 (January 2025) | 10 | — | Verified | 2026-05-27 |
No matching rows.