StableToolBench
StableToolBench evaluates LLM tool-use systems on solvable tool-query tasks, reporting pass-rate and win-rate scores across instruction, category, and tool subsets.
10rows
pass_rate_averageprimary metric
2026-05-06sampled
Metadata
Metrics
Pass Rate Average, Pass Rate Average SE (lower is better), Win Rate Average
| Rank | Subject | Pass Rate Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4-Turbo-Preview (DFS) | 73.20 | — | Imported | 2026-05-06 |
| 2 | GPT-3.5-Turbo-1106 (DFS) | 69.90 | — | Imported | 2026-05-06 |
| 3 | GPT-4-0613 (DFS) | 69.70 | — | Imported | 2026-05-06 |
| 4 | GPT-3.5-Turbo-0613 (DFS) | 68.10 | — | Imported | 2026-05-06 |
| 5 | GPT-4-Turbo-Preview (CoT) | 60.80 | — | Imported | 2026-05-06 |
| 6 | ToolLLaMA v2 (DFS) | 58.70 | — | Imported | 2026-05-06 |
| 7 | GPT-4-0613 (CoT) | 55.40 | — | Imported | 2026-05-06 |
| 8 | GPT-3.5-Turbo-1106 (CoT) | 52.10 | — | Imported | 2026-05-06 |
| 9 | GPT-3.5-Turbo-0613 (CoT) | 49.10 | — | Imported | 2026-05-06 |
| 10 | ToolLLaMA v2 (CoT) | 38.90 | — | Imported | 2026-05-06 |
No matching rows.