ITBench-AA
Artificial Analysis implementation of IBM's ITBench SRE benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots.
24rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Average Precision at Full Recall, Average F1, Pass Rate, Average Turns (lower is better)
| Rank | Subject | Average Precision at Full Recall | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | 46.7% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 2 | GPT-5.5 (xhigh) | 45.8% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 3 | Qwen3.7 Max | 42.5% | Qwen3.7 Max qwen-qwen3.7-max | Imported | 2026-05-28 |
| 4 | Gemini 3.5 Flash (high) | 40.3% | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 5 | GLM-5.1 (Reasoning) | 40.3% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 6 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 39.8% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 7 | DeepSeek V4 Pro (Reasoning, Max Effort) | 38.3% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 8 | MiMo-V2.5-Pro | 38.2% | MiMo-V2.5-Pro xiaomi-mimo-v2.5-pro | Imported | 2026-05-28 |
| 9 | Gemma 4 31B (Reasoning) | 37.3% | Gemma 4 31B google-gemma-4-31b-it | Imported | 2026-05-28 |
| 10 | Qwen3.5 27B (Reasoning) | 35.5% | Qwen3.5-27B qwen-qwen3.5-27b | Imported | 2026-05-28 |
| 11 | GPT-5.4 mini (xhigh) | 35.2% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-28 |
| 12 | GPT-5.4 (xhigh) | 34.5% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 13 | Qwen3.5 397B A17B (Reasoning) | 34.1% | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Imported | 2026-05-28 |
| 14 | Grok 4.3 (high) | 32.7% | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
| 15 | DeepSeek V4 Flash (Reasoning, Max Effort) | 31.5% | DeepSeek V4 Flash deepseek-deepseek-v4-flash | Imported | 2026-05-28 |
| 16 | Kimi K2.6 | 31.2% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 17 | Gemini 3.1 Pro Preview | 30.3% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 18 | Claude 4.5 Haiku (Reasoning) | 27.3% | — | Imported | 2026-05-28 |
| 19 | MiniMax-M2.7 | 26.5% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-28 |
| 20 | GPT-5.4 nano (xhigh) | 24.4% | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-28 |
| 21 | Gemma 4 26B A4B (Reasoning) | 23.6% | Gemma 4 26B A4B google-gemma-4-26b-a4b-it | Imported | 2026-05-28 |
| 22 | Qwen3.5 35B A3B (Reasoning) | 21.5% | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Imported | 2026-05-28 |
| 23 | GPT-5.4 (Non-reasoning) | 18.9% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 24 | Llama 3.3 Instruct 70B | 0.6% | — | Imported | 2026-05-28 |
No matching rows.