Claw-Eval-Live
Quarterly refreshed enterprise-workflow benchmark grounded in live ClawHub marketplace signals and scored with deterministic checks plus structured judging.
13rows
pass_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Pass Rate, Overall Completion Score, Tokens / Task (lower is better), Turns / Task (lower is better), Seconds / Task (lower is better)
| Rank | Subject | Pass Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 66.7 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-27 |
| 2 | GPT-5.4 | 63.8 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-27 |
| 3 | Claude Sonnet 4.6 | 61.9 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 4 | GLM-5 | 61.9 | GLM 5 z-ai-glm-5 | Imported | 2026-05-27 |
| 5 | MiniMax M2.7 | 54.3 | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-27 |
| 6 | MiMo V2 Pro | 53.3 | MiMo-V2-Pro xiaomi-mimo-v2-pro | Imported | 2026-05-27 |
| 7 | Kimi K2.5 | 53.3 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-27 |
| 8 | Gemini 3.1 Pro | 53.3 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-27 |
| 9 | DeepSeek V3.2 | 51.4 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-27 |
| 10 | Qwen 3.6 Plus | 50.5 | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-27 |
| 11 | MiniMax M2.5 | 50.5 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-27 |
| 12 | Qwen 3.5 397B | 49.5 | — | Imported | 2026-05-27 |
| 13 | Doubao Seed 2.0 | 43.8 | — | Imported | 2026-05-27 |
No matching rows.