ToolSandbox

ToolSandbox: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.

13rows
average_similarityprimary metric
2026-05-27sampled

Metadata

Metrics

Avg Score, Single Tool Call, Multiple Tool Call, Single User Turn, Multiple User Turn, State Dependency, Canonicalization, Insufficient Information, 0 Distraction Tools, 3 Distraction Tools, 10 Distraction Tools, All Tools, Tool Name Scrambled, Tool Description Scrambled, Argument Description Scrambled, Argument Type Scrambled

Latest Results

Rows are transcribed from public ToolSandbox paper Table 5. Primary score is average similarity across all scenarios.

Rank Subject Avg Score Model Match Provenance Sampled
1 GPT-4o-2024-05-13 73 GPT-4o
openai-gpt-4o
Imported 2026-05-27
2 Claude-3-Opus-20240229 69.2 Imported 2026-05-27
3 GPT-3.5-Turbo-0125 65.6 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
4 GPT-4-0125-Preview 64.3 GPT-4
openai-gpt-4
Imported 2026-05-27
5 Claude-3-Sonnet-20240229 63.8 Imported 2026-05-27
6 Gemini-1.5-Pro-001 60.4 Imported 2026-05-27
7 Claude-3-Haiku-20240307 54.9 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-27
8 Gemini-1.0-Pro 38.1 Imported 2026-05-27
9 Hermes-2-Pro-Mistral-7B 31.4 Imported 2026-05-27
10 Mistral-7B-Instruct-v0.3 29.8 Imported 2026-05-27
11 C4AI-Command-R-v01 26.2 Imported 2026-05-27
12 Gorilla-Openfunctions-v2 25.6 Imported 2026-05-27
13 C4AI-Command R+ 24.7 Imported 2026-05-27