EnterpriseOps-Gym
A stateful enterprise operations benchmark for evaluating LLM agents on long-horizon planning, tool use, and policy-governed workflows.
24rows
task_success_rateprimary metric
2026-05-05sampled
Metadata
Metrics
Task Success Rate, Teams, CSM, Email, ITSM, Calendar, HR, Drive, Hybrid
| Rank | Subject | Task Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 44.6% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-05 |
| 2 | Claude Sonnet 4.6 | 40.4% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-05 |
| 3 | Claude Opus 4.5 | 37% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-05 |
| 4 | Gemini 3.1 Pro | 36.6% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-05 |
| 5 | Gemini-3-Flash | 31.7% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-05 |
| 6 | GPT-5.2 (High) | 31.3% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-05 |
| 7 | Claude Sonnet 4.5 | 30.5% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-05 |
| 8 | GPT-5 | 29.2% | GPT-5 openai-gpt-5 | Imported | 2026-05-05 |
| 9 | Gemini-3-Pro | 27.4% | Gemini 3 google-gemini-3 | Imported | 2026-05-05 |
| 10 | Nvidia Nemotron 3 Super (Think) | 27.3% | — | Imported | 2026-05-05 |
| 11 | Kimi-K2.5-Thinking | 26.2% | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-05 |
| 12 | DeepSeek-V3.2 (High) | 23.8% | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-05 |
| 13 | Minimax-m2.7 | 23% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-05 |
| 14 | GPT-OSS-120B (High) | 23% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-05 |
| 15 | GLM-5 | 22.2% | GLM 5 z-ai-glm-5 | Imported | 2026-05-05 |
| 16 | DeepSeek-V3.2 (Medium) | 21.8% | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-05 |
| 17 | GPT-5.2 (Low) | 21.1% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-05 |
| 18 | GPT-5-Mini | 20.6% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-05 |
| 19 | Kimi-K2-Thinking | 19.2% | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-05 |
| 20 | Gemini-2.5-Pro | 17.8% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-05 |
| 21 | Qwen3-30B (Think) | 16.3% | — | Imported | 2026-05-05 |
| 22 | Qwen3-235B (Inst.) | 15.8% | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-05 |
| 23 | Gemma-26b-a4b | 15.1% | Gemma 4 26B A4B google-gemma-4-26b-a4b-it | Imported | 2026-05-05 |
| 24 | Qwen3-4B (Think) | 13.2% | — | Imported | 2026-05-05 |
No matching rows.