τ-bench

τ-bench: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.

9rows
pass_1primary metric
2026-05-27sampled

Metadata

Metrics

Pass^1, Pass^2, Pass^3, Pass^4

Latest Results

Rows are parsed from the public tau-bench README Markdown tables. The README warns that these tasks are outdated and recommends tau3-bench for the latest fixed tasks.

Rank Subject Pass^1 Model Match Provenance Sampled
1 TC (claude-3-5-sonnet-20241022) (retail) 0.692 Imported 2026-05-27
2 TC (claude-3-5-sonnet-20240620) (retail) 0.626 Imported 2026-05-27
3 TC (gpt-4o) (retail) 0.604 Imported 2026-05-27
4 TC (claude-3-5-sonnet-20241022) (airline) 0.46 Imported 2026-05-27
5 TC (gpt-4o) (airline) 0.42 Imported 2026-05-27
6 Act (gpt-4o) (airline) 0.365 Imported 2026-05-27
7 TC (claude-3-5-sonnet-20240620) (airline) 0.36 Imported 2026-05-27
8 ReAct (gpt-4o) (airline) 0.325 Imported 2026-05-27
9 TC (gpt-4o-mini) (airline) 0.225 Imported 2026-05-27