TAU2-bench Retail
TAU2-bench customer-service benchmark variant for retail workflows, measuring agents across multi-turn tool-use tasks.
15rows
percent_successfulprimary metric
2026-05-06sampled
Metadata
Metrics
Successful Sessions, Benchmark Score, Finished Successful, Avg. Agent Cost (lower is better), Avg. Steps (lower is better)
| Rank | Subject | Successful Sessions | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | OpenAI Solo / openai/aws/claude-opus-4-5 | 0.85 | — | Imported | 2026-05-06 |
| 2 | Claude Code CLI / openai/aws/claude-opus-4-5 | 0.83 | — | Imported | 2026-05-06 |
| 3 | LiteLLM Tool Calling / openai/gcp/gemini-3-pro-preview | 0.82 | — | Imported | 2026-05-06 |
| 4 | LiteLLM Tool Calling with Shortlisting / openai/gcp/gemini-3-pro-preview | 0.82 | — | Imported | 2026-05-06 |
| 5 | SmolAgents Code / openai/aws/claude-opus-4-5 | 0.78 | — | Imported | 2026-05-06 |
| 6 | LiteLLM Tool Calling / openai/aws/claude-opus-4-5 | 0.78 | — | Imported | 2026-05-06 |
| 7 | LiteLLM Tool Calling with Shortlisting / openai/aws/claude-opus-4-5 | 0.78 | — | Imported | 2026-05-06 |
| 8 | SmolAgents Code / openai/gcp/gemini-3-pro-preview | 0.75 | — | Imported | 2026-05-06 |
| 9 | OpenAI Solo / openai/gcp/gemini-3-pro-preview | 0.73 | — | Imported | 2026-05-06 |
| 10 | LiteLLM Tool Calling / openai/Azure/gpt-5.2-2025-12-11 | 0.73 | — | Imported | 2026-05-06 |
| 11 | LiteLLM Tool Calling with Shortlisting / openai/Azure/gpt-5.2-2025-12-11 | 0.73 | — | Imported | 2026-05-06 |
| 12 | Claude Code CLI / openai/gcp/gemini-3-pro-preview | 0.71 | — | Imported | 2026-05-06 |
| 13 | SmolAgents Code / openai/Azure/gpt-5.2-2025-12-11 | 0.68 | — | Imported | 2026-05-06 |
| 14 | Claude Code CLI / openai/Azure/gpt-5.2-2025-12-11 | 0.64 | — | Imported | 2026-05-06 |
| 15 | OpenAI Solo / openai/Azure/gpt-5.2-2025-12-11 | 0.53 | — | Imported | 2026-05-06 |
No matching rows.