Tau2 Airline
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
20rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | LongCat-Flash-Thinking-2601 | 0.77 | — | Self-reported | 2026-05-06 |
| 2 | LongCat-Flash-Thinking | 0.68 | — | Self-reported | 2026-05-06 |
| 3 | GPT-5.1 | 0.67 | GPT-5.1 openai-gpt-5.1 | Self-reported | 2026-05-06 |
| 3 | GPT-5.1 Instant | 0.67 | GPT-5.1 openai-gpt-5.1 | Self-reported | 2026-05-06 |
| 3 | GPT-5.1 Thinking | 0.67 | GPT-5.1 openai-gpt-5.1 | Self-reported | 2026-05-06 |
| 6 | o3 | 0.65 | o3 openai-o3 | Self-reported | 2026-05-06 |
| 7 | Claude Haiku 4.5 | 0.64 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Self-reported | 2026-05-06 |
| 8 | GPT-5 | 0.63 | GPT-5 openai-gpt-5 | Self-reported | 2026-05-06 |
| 9 | Qwen3-Next-80B-A3B-Thinking | 0.60 | Qwen3 Next 80B A3B Thinking qwen-qwen3-next-80b-a3b-thinking | Self-reported | 2026-05-06 |
| 10 | Qwen3-235B-A22B-Thinking-2507 | 0.58 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Self-reported | 2026-05-06 |
| 10 | LongCat-Flash-Lite | 0.58 | — | Self-reported | 2026-05-06 |
| 10 | LongCat-Flash-Chat | 0.58 | — | Self-reported | 2026-05-06 |
| 13 | Kimi K2-Instruct-0905 | 0.56 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Self-reported | 2026-05-06 |
| 13 | Kimi K2 Instruct | 0.56 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Self-reported | 2026-05-06 |
| 15 | Nemotron 3 Super (120B A12B) | 0.56 | Nemotron 3 Super nvidia-nemotron-3-super-120b-a12b | Self-reported | 2026-05-06 |
| 16 | Mercury 2 | 0.53 | Mercury 2 inception-mercury-2 | Self-reported | 2026-05-06 |
| 17 | Nemotron 3 Nano (30B A3B) | 0.48 | Nemotron 3 Nano 30B A3B nvidia-nemotron-3-nano-30b-a3b | Self-reported | 2026-05-06 |
| 18 | GPT-4o | 0.46 | GPT-4o (2024-08-06) openai-gpt-4o-2024-08-06 | Self-reported | 2026-05-06 |
| 18 | Qwen3-Next-80B-A3B-Instruct | 0.46 | Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct | Self-reported | 2026-05-06 |
| 20 | Qwen3-235B-A22B-Instruct-2507 | 0.44 | Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507 | Self-reported | 2026-05-06 |
No matching rows.