WildClawBench
End-to-end AI agent benchmark with 60 original tasks in a live OpenClaw environment spanning productivity, code intelligence, social interaction, search, creative synthesis, and safety alignment workflows.
14rows
overall_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Overall Score, Avg Time (lower is better), Avg Cost (lower is better)
| Rank | Subject | Overall Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 51.60 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | GPT-5.4 | 50.30 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 3 | GLM 5 | 42.60 | GLM 5 z-ai-glm-5 | Imported | 2026-05-06 |
| 4 | Gemini 3.1 Pro | 40.80 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 5 | MiMo V2 Pro | 40.20 | MiMo-V2-Pro xiaomi-mimo-v2-pro | Imported | 2026-05-06 |
| 6 | Qwen3.5 397B | 34.50 | — | Imported | 2026-05-06 |
| 7 | DeepSeek V3.2 | 34 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 8 | GLM 5 Turbo | 33.90 | GLM 5 Turbo z-ai-glm-5-turbo | Imported | 2026-05-06 |
| 9 | MiniMax M2.7 | 33.80 | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-06 |
| 10 | Kimi K2.5 | 30.80 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 11 | MiMo V2 Flash | 30.80 | MiMo-V2-Flash xiaomi-mimo-v2-flash | Imported | 2026-05-06 |
| 12 | MiniMax M2.5 | 27.10 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-06 |
| 13 | Step 3.5 Flash | 26.70 | Step 3.5 Flash stepfun-step-3.5-flash | Imported | 2026-05-06 |
| 14 | Grok 4.20 Beta | 19.30 | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-06 |
No matching rows.