ToolSandbox
ToolSandbox: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
13rows
average_similarityprimary metric
2026-05-27sampled
Metadata
Metrics
Avg Score, Single Tool Call, Multiple Tool Call, Single User Turn, Multiple User Turn, State Dependency, Canonicalization, Insufficient Information, 0 Distraction Tools, 3 Distraction Tools, 10 Distraction Tools, All Tools, Tool Name Scrambled, Tool Description Scrambled, Argument Description Scrambled, Argument Type Scrambled
| Rank | Subject | Avg Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4o-2024-05-13 | 73 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 2 | Claude-3-Opus-20240229 | 69.2 | — | Imported | 2026-05-27 |
| 3 | GPT-3.5-Turbo-0125 | 65.6 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 4 | GPT-4-0125-Preview | 64.3 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 5 | Claude-3-Sonnet-20240229 | 63.8 | — | Imported | 2026-05-27 |
| 6 | Gemini-1.5-Pro-001 | 60.4 | — | Imported | 2026-05-27 |
| 7 | Claude-3-Haiku-20240307 | 54.9 | Claude 3 Haiku anthropic-claude-3-haiku | Imported | 2026-05-27 |
| 8 | Gemini-1.0-Pro | 38.1 | — | Imported | 2026-05-27 |
| 9 | Hermes-2-Pro-Mistral-7B | 31.4 | — | Imported | 2026-05-27 |
| 10 | Mistral-7B-Instruct-v0.3 | 29.8 | — | Imported | 2026-05-27 |
| 11 | C4AI-Command-R-v01 | 26.2 | — | Imported | 2026-05-27 |
| 12 | Gorilla-Openfunctions-v2 | 25.6 | — | Imported | 2026-05-27 |
| 13 | C4AI-Command R+ | 24.7 | — | Imported | 2026-05-27 |
No matching rows.