Claw-Eval
Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environments, and complete end-to-end tasks autonomously.
13rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Normalized Score
Showing 2 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 Max | 70.4% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 2 | Qwen3.7 Max | 65.2% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 3 | GLM-5.1 Thinking | 62.7% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 4 | Kimi K2.6 Thinking | 61.5% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 5 | DeepSeek V4 Pro Max | 58.4% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 6 | Qwen3.6 Plus | 57.1% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | Kimi K2.6 | 0.81 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-06 |
| 2 | GLM-5V-Turbo | 0.75 | GLM 5V Turbo z-ai-glm-5v-turbo | Self-reported | 2026-05-06 |
| 3 | MiMo-V2-Pro | 0.61 | MiMo-V2-Pro xiaomi-mimo-v2-pro | Self-reported | 2026-05-06 |
| 4 | Qwen3.6-27B | 0.61 | Qwen3.6 27B qwen-qwen3.6-27b | Self-reported | 2026-05-06 |
| 5 | Qwen3.6 Plus | 0.59 | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-06 |
| 6 | MiMo-V2-Omni | 0.55 | MiMo-V2-Omni xiaomi-mimo-v2-omni | Self-reported | 2026-05-06 |
| 7 | Qwen3.6-35B-A3B | 0.50 | Qwen3.6 35B A3B qwen-qwen3.6-35b-a3b | Self-reported | 2026-05-06 |
No matching rows.