WildClawBench

End-to-end AI agent benchmark with 60 original tasks in a live OpenClaw environment spanning productivity, code intelligence, social interaction, search, creative synthesis, and safety alignment workflows.

14rows
overall_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Overall Score, Avg Time (lower is better), Avg Cost (lower is better)

Latest Results

Rows are parsed from the WildClawBench README leaderboard table. Source model and organization names are preserved.

Rank Subject Overall Score Model Match Provenance Sampled
1 Claude Opus 4.6 51.60 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 GPT-5.4 50.30 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
3 GLM 5 42.60 GLM GLM 5
z-ai-glm-5
Imported 2026-05-06
4 Gemini 3.1 Pro 40.80 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
5 MiMo V2 Pro 40.20 MiMo-V2-Pro
xiaomi-mimo-v2-pro
Imported 2026-05-06
6 Qwen3.5 397B 34.50 Imported 2026-05-06
7 DeepSeek V3.2 34 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-06
8 GLM 5 Turbo 33.90 GLM GLM 5 Turbo
z-ai-glm-5-turbo
Imported 2026-05-06
9 MiniMax M2.7 33.80 MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-06
10 Kimi K2.5 30.80 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
11 MiMo V2 Flash 30.80 MiMo-V2-Flash
xiaomi-mimo-v2-flash
Imported 2026-05-06
12 MiniMax M2.5 27.10 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-06
13 Step 3.5 Flash 26.70 S Step 3.5 Flash
stepfun-step-3.5-flash
Imported 2026-05-06
14 Grok 4.20 Beta 19.30 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-06