τ-bench
τ-bench: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
9rows
pass_1primary metric
2026-05-27sampled
Metadata
Metrics
Pass^1, Pass^2, Pass^3, Pass^4
| Rank | Subject | Pass^1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | TC (claude-3-5-sonnet-20241022) (retail) | 0.692 | — | Imported | 2026-05-27 |
| 2 | TC (claude-3-5-sonnet-20240620) (retail) | 0.626 | — | Imported | 2026-05-27 |
| 3 | TC (gpt-4o) (retail) | 0.604 | — | Imported | 2026-05-27 |
| 4 | TC (claude-3-5-sonnet-20241022) (airline) | 0.46 | — | Imported | 2026-05-27 |
| 5 | TC (gpt-4o) (airline) | 0.42 | — | Imported | 2026-05-27 |
| 6 | Act (gpt-4o) (airline) | 0.365 | — | Imported | 2026-05-27 |
| 7 | TC (claude-3-5-sonnet-20240620) (airline) | 0.36 | — | Imported | 2026-05-27 |
| 8 | ReAct (gpt-4o) (airline) | 0.325 | — | Imported | 2026-05-27 |
| 9 | TC (gpt-4o-mini) (airline) | 0.225 | — | Imported | 2026-05-27 |
No matching rows.