EnterpriseOps-Gym

A stateful enterprise operations benchmark for evaluating LLM agents on long-horizon planning, tool use, and policy-governed workflows.

24rows
task_success_rateprimary metric
2026-05-05sampled

Metadata

Metrics

Task Success Rate, Teams, CSM, Email, ITSM, Calendar, HR, Drive, Hybrid

Latest Results

Oracle Mode task success rate values are percentages. Rows are ranked by the public average task success rate across EnterpriseOps domains.

Rank Subject Task Success Rate Model Match Provenance Sampled
1 Claude Opus 4.6 44.6% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-05
2 Claude Sonnet 4.6 40.4% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-05
3 Claude Opus 4.5 37% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-05
4 Gemini 3.1 Pro 36.6% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-05
5 Gemini-3-Flash 31.7% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-05
6 GPT-5.2 (High) 31.3% GPT-5.2
openai-gpt-5.2
Imported 2026-05-05
7 Claude Sonnet 4.5 30.5% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-05
8 GPT-5 29.2% GPT-5
openai-gpt-5
Imported 2026-05-05
9 Gemini-3-Pro 27.4% Gemini 3
google-gemini-3
Imported 2026-05-05
10 Nvidia Nemotron 3 Super (Think) 27.3% Imported 2026-05-05
11 Kimi-K2.5-Thinking 26.2% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-05
12 DeepSeek-V3.2 (High) 23.8% DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-05
13 Minimax-m2.7 23% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-05
14 GPT-OSS-120B (High) 23% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-05
15 GLM-5 22.2% GLM GLM 5
z-ai-glm-5
Imported 2026-05-05
16 DeepSeek-V3.2 (Medium) 21.8% DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-05
17 GPT-5.2 (Low) 21.1% GPT-5.2
openai-gpt-5.2
Imported 2026-05-05
18 GPT-5-Mini 20.6% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-05
19 Kimi-K2-Thinking 19.2% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-05
20 Gemini-2.5-Pro 17.8% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-05
21 Qwen3-30B (Think) 16.3% Imported 2026-05-05
22 Qwen3-235B (Inst.) 15.8% Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-05
23 Gemma-26b-a4b 15.1% Gemma 4 26B A4B
google-gemma-4-26b-a4b-it
Imported 2026-05-05
24 Qwen3-4B (Think) 13.2% Imported 2026-05-05