ITBench-AA

Artificial Analysis implementation of IBM's ITBench SRE benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots.

24rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Average Precision at Full Recall, Average F1, Pass Rate, Average Turns (lower is better)

Latest Results

Rows are parsed from the public Artificial Analysis Next.js RSC defaultData payload and ranked by the configured primary metric.

Rank Subject Average Precision at Full Recall Model Match Provenance Sampled
1 Claude Opus 4.7 (Adaptive Reasoning, Max Effort) 46.7% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
2 GPT-5.5 (xhigh) 45.8% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
3 Qwen3.7 Max 42.5% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
4 Gemini 3.5 Flash (high) 40.3% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
5 GLM-5.1 (Reasoning) 40.3% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
6 Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) 39.8% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
7 DeepSeek V4 Pro (Reasoning, Max Effort) 38.3% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
8 MiMo-V2.5-Pro 38.2% MiMo-V2.5-Pro
xiaomi-mimo-v2.5-pro
Imported 2026-05-28
9 Gemma 4 31B (Reasoning) 37.3% Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-28
10 Qwen3.5 27B (Reasoning) 35.5% Qwen3.5-27B
qwen-qwen3.5-27b
Imported 2026-05-28
11 GPT-5.4 mini (xhigh) 35.2% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
12 GPT-5.4 (xhigh) 34.5% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
13 Qwen3.5 397B A17B (Reasoning) 34.1% Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Imported 2026-05-28
14 Grok 4.3 (high) 32.7% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
15 DeepSeek V4 Flash (Reasoning, Max Effort) 31.5% DeepSeek V4 Flash
deepseek-deepseek-v4-flash
Imported 2026-05-28
16 Kimi K2.6 31.2% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
17 Gemini 3.1 Pro Preview 30.3% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
18 Claude 4.5 Haiku (Reasoning) 27.3% Imported 2026-05-28
19 MiniMax-M2.7 26.5% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
20 GPT-5.4 nano (xhigh) 24.4% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
21 Gemma 4 26B A4B (Reasoning) 23.6% Gemma 4 26B A4B
google-gemma-4-26b-a4b-it
Imported 2026-05-28
22 Qwen3.5 35B A3B (Reasoning) 21.5% Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Imported 2026-05-28
23 GPT-5.4 (Non-reasoning) 18.9% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
24 Llama 3.3 Instruct 70B 0.6% Imported 2026-05-28