Toolathlon
Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
25rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Normalized Score
Showing 3 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 59.9% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.7 | 59.3% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.6 | 56.8% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 4 | Claude Sonnet 4.6 | 41% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Self-reported | 2026-05-28 |
| 1 | GPT-5.5 | 0.56 | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-06 |
| 2 | GPT-5.4 | 0.55 | GPT-5.4 openai-gpt-5.4 | Self-reported | 2026-05-06 |
| 3 | DeepSeek-V4-Pro-Max | 0.52 | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-06 |
| 4 | Kimi K2.6 | 0.50 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-06 |
| 5 | Gemini 3 Flash | 0.49 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Self-reported | 2026-05-06 |
| 6 | DeepSeek-V4-Flash-Max | 0.48 | DeepSeek V4 Flash deepseek-deepseek-v4-flash | Self-reported | 2026-05-06 |
| 7 | GPT-5.2 | 0.46 | GPT-5.2 openai-gpt-5.2 | Self-reported | 2026-05-06 |
| 7 | MiniMax M2.7 | 0.46 | MiniMax M2.7 minimax-minimax-m2.7 | Self-reported | 2026-05-06 |
| 9 | MiniMax M2.1 | 0.43 | MiniMax M2.1 minimax-minimax-m2.1 | Self-reported | 2026-05-06 |
| 10 | GPT-5.4 mini | 0.43 | GPT-5.4 Mini openai-gpt-5.4-mini | Self-reported | 2026-05-06 |
| 11 | GLM-5.1 | 0.41 | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-06 |
| 12 | Qwen3.6 Plus | 0.40 | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-06 |
| 13 | Qwen3.5-397B-A17B | 0.38 | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Self-reported | 2026-05-06 |
| 14 | GPT-5.4 nano | 0.35 | GPT-5.4 Nano openai-gpt-5.4-nano | Self-reported | 2026-05-06 |
| 15 | DeepSeek-V3.2-Speciale | 0.35 | DeepSeek V3.2 Speciale deepseek-deepseek-v3.2-speciale | Self-reported | 2026-05-06 |
| 15 | DeepSeek-V3.2 | 0.35 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Self-reported | 2026-05-06 |
| 15 | DeepSeek-V3.2 (Thinking) | 0.35 | R1 deepseek-r1 | Self-reported | 2026-05-06 |
| 18 | Qwen3.6-35B-A3B | 0.27 | Qwen3.6 35B A3B qwen-qwen3.6-35b-a3b | Self-reported | 2026-05-06 |
| 1 | GPT-5.5 | 55.6% | GPT-5.5 openai-gpt-5.5 | Launch post | 2026-04-23 |
| 2 | GPT-5.4 | 54.6% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-23 |
| 3 | Gemini 3.1 Pro Preview | 48.8% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Launch post | 2026-04-23 |
No matching rows.