StableToolBench

StableToolBench evaluates LLM tool-use systems on solvable tool-query tasks, reporting pass-rate and win-rate scores across instruction, category, and tool subsets.

10rows
pass_rate_averageprimary metric
2026-05-06sampled

Metadata

Metrics

Pass Rate Average, Pass Rate Average SE (lower is better), Win Rate Average

Latest Results

Rank Subject Pass Rate Average Model Match Provenance Sampled
1 GPT-4-Turbo-Preview (DFS) 73.20 Imported 2026-05-06
2 GPT-3.5-Turbo-1106 (DFS) 69.90 Imported 2026-05-06
3 GPT-4-0613 (DFS) 69.70 Imported 2026-05-06
4 GPT-3.5-Turbo-0613 (DFS) 68.10 Imported 2026-05-06
5 GPT-4-Turbo-Preview (CoT) 60.80 Imported 2026-05-06
6 ToolLLaMA v2 (DFS) 58.70 Imported 2026-05-06
7 GPT-4-0613 (CoT) 55.40 Imported 2026-05-06
8 GPT-3.5-Turbo-1106 (CoT) 52.10 Imported 2026-05-06
9 GPT-3.5-Turbo-0613 (CoT) 49.10 Imported 2026-05-06
10 ToolLLaMA v2 (CoT) 38.90 Imported 2026-05-06