TAU2-bench Retail

TAU2-bench customer-service benchmark variant for retail workflows, measuring agents across multi-turn tool-use tasks.

15rows
percent_successfulprimary metric
2026-05-06sampled

Metadata

Metrics

Successful Sessions, Benchmark Score, Finished Successful, Avg. Agent Cost (lower is better), Avg. Steps (lower is better)

Latest Results

Rows are ranked by percent_successful. Agent and model display names are preserved from the source dataset.

Rank Subject Successful Sessions Model Match Provenance Sampled
1 OpenAI Solo / openai/aws/claude-opus-4-5 0.85 Imported 2026-05-06
2 Claude Code CLI / openai/aws/claude-opus-4-5 0.83 Imported 2026-05-06
3 LiteLLM Tool Calling / openai/gcp/gemini-3-pro-preview 0.82 Imported 2026-05-06
4 LiteLLM Tool Calling with Shortlisting / openai/gcp/gemini-3-pro-preview 0.82 Imported 2026-05-06
5 SmolAgents Code / openai/aws/claude-opus-4-5 0.78 Imported 2026-05-06
6 LiteLLM Tool Calling / openai/aws/claude-opus-4-5 0.78 Imported 2026-05-06
7 LiteLLM Tool Calling with Shortlisting / openai/aws/claude-opus-4-5 0.78 Imported 2026-05-06
8 SmolAgents Code / openai/gcp/gemini-3-pro-preview 0.75 Imported 2026-05-06
9 OpenAI Solo / openai/gcp/gemini-3-pro-preview 0.73 Imported 2026-05-06
10 LiteLLM Tool Calling / openai/Azure/gpt-5.2-2025-12-11 0.73 Imported 2026-05-06
11 LiteLLM Tool Calling with Shortlisting / openai/Azure/gpt-5.2-2025-12-11 0.73 Imported 2026-05-06
12 Claude Code CLI / openai/gcp/gemini-3-pro-preview 0.71 Imported 2026-05-06
13 SmolAgents Code / openai/Azure/gpt-5.2-2025-12-11 0.68 Imported 2026-05-06
14 Claude Code CLI / openai/Azure/gpt-5.2-2025-12-11 0.64 Imported 2026-05-06
15 OpenAI Solo / openai/Azure/gpt-5.2-2025-12-11 0.53 Imported 2026-05-06