Toolathlon

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.

25rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Normalized Score

Showing 3 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.8 59.9% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
2 Claude Opus 4.7 59.3% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
3 Claude Opus 4.6 56.8% Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-28
4 Claude Sonnet 4.6 41% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Self-reported 2026-05-28
1 GPT-5.5 0.56 GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-06
2 GPT-5.4 0.55 GPT-5.4
openai-gpt-5.4
Self-reported 2026-05-06
3 DeepSeek-V4-Pro-Max 0.52 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Self-reported 2026-05-06
4 Kimi K2.6 0.50 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-06
5 Gemini 3 Flash 0.49 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
6 DeepSeek-V4-Flash-Max 0.48 DeepSeek V4 Flash
deepseek-deepseek-v4-flash
Self-reported 2026-05-06
7 GPT-5.2 0.46 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
7 MiniMax M2.7 0.46 MiniMax M2.7
minimax-minimax-m2.7
Self-reported 2026-05-06
9 MiniMax M2.1 0.43 MiniMax M2.1
minimax-minimax-m2.1
Self-reported 2026-05-06
10 GPT-5.4 mini 0.43 GPT-5.4 Mini
openai-gpt-5.4-mini
Self-reported 2026-05-06
11 GLM-5.1 0.41 GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-06
12 Qwen3.6 Plus 0.40 Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-06
13 Qwen3.5-397B-A17B 0.38 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Self-reported 2026-05-06
14 GPT-5.4 nano 0.35 GPT-5.4 Nano
openai-gpt-5.4-nano
Self-reported 2026-05-06
15 DeepSeek-V3.2-Speciale 0.35 DeepSeek V3.2 Speciale
deepseek-deepseek-v3.2-speciale
Self-reported 2026-05-06
15 DeepSeek-V3.2 0.35 DeepSeek V3.2
deepseek-deepseek-v3.2
Self-reported 2026-05-06
15 DeepSeek-V3.2 (Thinking) 0.35 R1
deepseek-r1
Self-reported 2026-05-06
18 Qwen3.6-35B-A3B 0.27 Qwen3.6 35B A3B
qwen-qwen3.6-35b-a3b
Self-reported 2026-05-06
1 GPT-5.5 55.6% GPT-5.5
openai-gpt-5.5
Launch post 2026-04-23
2 GPT-5.4 54.6% GPT-5.4
openai-gpt-5.4
Launch post 2026-04-23
3 Gemini 3.1 Pro Preview 48.8% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-23