ITBench-AA | BenchmarkList

Metadata

ID: itbench_aa
Category: Agentic
Release: 2025-02-07
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Average Precision at Full Recall, Average F1, Pass Rate, Average Turns (lower is better)

Rank	Subject	Average Precision at Full Recall	Model Match	Provenance	Sampled
1	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	46.7%	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-28
2	GPT-5.5 (xhigh)	45.8%	GPT-5.5 openai-gpt-5.5	Imported	2026-05-28
3	Qwen3.7 Max	42.5%	Qwen3.7 Max qwen-qwen3.7-max	Imported	2026-05-28
4	Gemini 3.5 Flash (high)	40.3%	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-28
5	GLM-5.1 (Reasoning)	40.3%	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-28
6	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	39.8%	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-28
7	DeepSeek V4 Pro (Reasoning, Max Effort)	38.3%	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Imported	2026-05-28
8	MiMo-V2.5-Pro	38.2%	MiMo-V2.5-Pro xiaomi-mimo-v2.5-pro	Imported	2026-05-28
9	Gemma 4 31B (Reasoning)	37.3%	Gemma 4 31B google-gemma-4-31b-it	Imported	2026-05-28
10	Qwen3.5 27B (Reasoning)	35.5%	Qwen3.5-27B qwen-qwen3.5-27b	Imported	2026-05-28
11	GPT-5.4 mini (xhigh)	35.2%	GPT-5.4 Mini openai-gpt-5.4-mini	Imported	2026-05-28
12	GPT-5.4 (xhigh)	34.5%	GPT-5.4 openai-gpt-5.4	Imported	2026-05-28
13	Qwen3.5 397B A17B (Reasoning)	34.1%	Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b	Imported	2026-05-28
14	Grok 4.3 (high)	32.7%	GROK Grok 4.3 x-ai-grok-4.3	Imported	2026-05-28
15	DeepSeek V4 Flash (Reasoning, Max Effort)	31.5%	DeepSeek V4 Flash deepseek-deepseek-v4-flash	Imported	2026-05-28
16	Kimi K2.6	31.2%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-28
17	Gemini 3.1 Pro Preview	30.3%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-28
18	Claude 4.5 Haiku (Reasoning)	27.3%	—	Imported	2026-05-28
19	MiniMax-M2.7	26.5%	MiniMax M2.7 minimax-minimax-m2.7	Imported	2026-05-28
20	GPT-5.4 nano (xhigh)	24.4%	GPT-5.4 Nano openai-gpt-5.4-nano	Imported	2026-05-28
21	Gemma 4 26B A4B (Reasoning)	23.6%	Gemma 4 26B A4B google-gemma-4-26b-a4b-it	Imported	2026-05-28
22	Qwen3.5 35B A3B (Reasoning)	21.5%	Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b	Imported	2026-05-28
23	GPT-5.4 (Non-reasoning)	18.9%	GPT-5.4 openai-gpt-5.4	Imported	2026-05-28
24	Llama 3.3 Instruct 70B	0.6%	—	Imported	2026-05-28

Metadata

Metrics

Latest Results