TAU3-Bench

TAU3-Bench is a benchmark for evaluating general-purpose agent capabilities, testing models on multi-turn interactions with simulated user models, retrieval, and complex decision-making scenarios.

3rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Qwen3.6 Plus 0.71 Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-06
2 GLM-5.1 0.71 GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-06
3 Qwen3.6-35B-A3B 0.67 Qwen3.6 35B A3B
qwen-qwen3.6-35b-a3b
Self-reported 2026-05-06