APEX-Agents-AA
Artificial Analysis implementation of APEX-Agents using the Stirrup agent harness for long-horizon, cross-application professional-services tasks.
18rows
scoreprimary metric
2026-05-11sampled
Metadata
Metrics
Pass@1
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.5 (xhigh) | 37.7% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-11 |
| 2 | GPT-5.4 (xhigh) | 33.3% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-11 |
| 3 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | 33% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-11 |
| 4 | Gemini 3.1 Pro Preview | 32% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-11 |
| 5 | GPT-5.4 mini (xhigh) | 28.2% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-11 |
| 6 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 28% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-11 |
| 7 | Gemini 3 Flash Preview (Reasoning) | 27.7% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-11 |
| 8 | GPT-5.4 nano (xhigh) | 24.9% | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-11 |
| 9 | Qwen3.5 397B A17B (Reasoning) | 15.3% | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Imported | 2026-05-11 |
| 10 | DeepSeek V3.2 (Reasoning) | 14.5% | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-11 |
| 11 | GLM-5 (Reasoning) | 14.5% | GLM 5 z-ai-glm-5 | Imported | 2026-05-11 |
| 12 | Grok 4.20 0309 (Reasoning) | 14.2% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-11 |
| 13 | Gemini 3.1 Flash-Lite Preview | 12.2% | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-11 |
| 14 | Kimi K2.5 (Reasoning) | 11.5% | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-11 |
| 15 | MiniMax-M2.7 | 10.6% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-11 |
| 16 | gpt-oss-120B (high) | 3.1% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-11 |
| 17 | NVIDIA Nemotron 3 Super 120B A12B (Reasoning) | 1.8% | Nemotron 3 Super nvidia-nemotron-3-super-120b-a12b | Imported | 2026-05-11 |
| 18 | gpt-oss-20B (high) | 0.7% | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-05-11 |
No matching rows.