Tau2 Airline

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

20rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 LongCat-Flash-Thinking-2601 0.77 Self-reported 2026-05-06
2 LongCat-Flash-Thinking 0.68 Self-reported 2026-05-06
3 GPT-5.1 0.67 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
3 GPT-5.1 Instant 0.67 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
3 GPT-5.1 Thinking 0.67 GPT-5.1
openai-gpt-5.1
Self-reported 2026-05-06
6 o3 0.65 o3
openai-o3
Self-reported 2026-05-06
7 Claude Haiku 4.5 0.64 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Self-reported 2026-05-06
8 GPT-5 0.63 GPT-5
openai-gpt-5
Self-reported 2026-05-06
9 Qwen3-Next-80B-A3B-Thinking 0.60 Qwen3 Next 80B A3B Thinking
qwen-qwen3-next-80b-a3b-thinking
Self-reported 2026-05-06
10 Qwen3-235B-A22B-Thinking-2507 0.58 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Self-reported 2026-05-06
10 LongCat-Flash-Lite 0.58 Self-reported 2026-05-06
10 LongCat-Flash-Chat 0.58 Self-reported 2026-05-06
13 Kimi K2-Instruct-0905 0.56 KIMI MoonshotAI: Kimi K2 0905
moonshotai-kimi-k2-0905
Self-reported 2026-05-06
13 Kimi K2 Instruct 0.56 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Self-reported 2026-05-06
15 Nemotron 3 Super (120B A12B) 0.56 Nemotron 3 Super
nvidia-nemotron-3-super-120b-a12b
Self-reported 2026-05-06
16 Mercury 2 0.53 I Mercury 2
inception-mercury-2
Self-reported 2026-05-06
17 Nemotron 3 Nano (30B A3B) 0.48 Nemotron 3 Nano 30B A3B
nvidia-nemotron-3-nano-30b-a3b
Self-reported 2026-05-06
18 GPT-4o 0.46 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
18 Qwen3-Next-80B-A3B-Instruct 0.46 Qwen3 Next 80B A3B Instruct
qwen-qwen3-next-80b-a3b-instruct
Self-reported 2026-05-06
20 Qwen3-235B-A22B-Instruct-2507 0.44 Qwen3 235B A22B Instruct 2507
qwen-qwen3-235b-a22b-2507
Self-reported 2026-05-06