t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

22rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Gemini 3.1 Pro 0.99 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-06
2 Gemini 3 Flash 0.90 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
3 GLM-5 0.90 GLM GLM 5
z-ai-glm-5
Self-reported 2026-05-06
4 Qwen3.5-397B-A17B 0.87 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Self-reported 2026-05-06
5 Gemma 4 31B 0.86 Gemma 4 31B
google-gemma-4-31b-it
Self-reported 2026-05-06
6 Gemma 4 26B-A4B 0.85 Gemma 4 26B A4B
google-gemma-4-26b-a4b-it
Self-reported 2026-05-06
7 Gemini 3 Pro 0.85 Gemini 3
google-gemini-3
Self-reported 2026-05-06
8 Qwen3.5-35B-A3B 0.81 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Self-reported 2026-05-06
9 DeepSeek-V3.2-Speciale 0.80 DeepSeek V3.2 Speciale
deepseek-deepseek-v3.2-speciale
Self-reported 2026-05-06
9 DeepSeek-V3.2 0.80 DeepSeek V3.2
deepseek-deepseek-v3.2
Self-reported 2026-05-06
11 DeepSeek-V3.2 (Thinking) 0.80 R1
deepseek-r1
Self-reported 2026-05-06
12 Qwen3.5-4B 0.80 Self-reported 2026-05-06
13 Qwen3.5-122B-A10B 0.80 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Self-reported 2026-05-06
14 Qwen3.5-9B 0.79 Qwen3.5-9B
qwen-qwen3.5-9b
Self-reported 2026-05-06
15 Qwen3.5-27B 0.79 Qwen3.5-27B
qwen-qwen3.5-27b
Self-reported 2026-05-06
16 Qwen3 Max 0.75 Qwen3 Max
qwen-qwen3-max
Self-reported 2026-05-06
17 K-EXAONE-236B-A23B 0.73 Self-reported 2026-05-06
18 GPT OSS 120B High 0.64 Self-reported 2026-05-06
19 Gemma 4 E4B 0.57 Self-reported 2026-05-06
20 Qwen3.5-2B 0.49 Self-reported 2026-05-06
21 Gemma 4 E2B 0.29 Self-reported 2026-05-06
22 Qwen3.5-0.8B 0.12 Self-reported 2026-05-06