Claw-Eval-Live

Quarterly refreshed enterprise-workflow benchmark grounded in live ClawHub marketplace signals and scored with deterministic checks plus structured judging.

13rows
pass_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Pass Rate, Overall Completion Score, Tokens / Task (lower is better), Turns / Task (lower is better), Seconds / Task (lower is better)

Latest Results

Rows parsed from the Claw-Eval-Live public page JavaScript models array. The source ranks by pass rate, with overall completion score as tiebreaker.

Rank Subject Pass Rate Model Match Provenance Sampled
1 Claude Opus 4.6 66.7 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-27
2 GPT-5.4 63.8 GPT-5.4
openai-gpt-5.4
Imported 2026-05-27
3 Claude Sonnet 4.6 61.9 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-27
4 GLM-5 61.9 GLM GLM 5
z-ai-glm-5
Imported 2026-05-27
5 MiniMax M2.7 54.3 MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-27
6 MiMo V2 Pro 53.3 MiMo-V2-Pro
xiaomi-mimo-v2-pro
Imported 2026-05-27
7 Kimi K2.5 53.3 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-27
8 Gemini 3.1 Pro 53.3 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-27
9 DeepSeek V3.2 51.4 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-27
10 Qwen 3.6 Plus 50.5 Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-27
11 MiniMax M2.5 50.5 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-27
12 Qwen 3.5 397B 49.5 Imported 2026-05-27
13 Doubao Seed 2.0 43.8 Imported 2026-05-27