ClawProBench

OpenClaw agent benchmark measuring model performance on reasoning, planning, tool use, reliability, efficiency, and safety across repeated runs.

61rows
final_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Final Score, Pass^3, Pass@3, Avg Score, Capability, Efficiency, Planning, Safety, Tool Use, Constraints, Error Recovery, Synthesis, Avg Runtime (lower is better), Total Tokens (lower is better), Cost (lower is better)

Latest Results

Rows ranked by highest Final Score.

Rank Subject Final Score Model Match Provenance Sampled
1 gpt-5.5-xhigh 67.90 Imported 2026-05-06
2 deepseek-v4-pro 64.38 Imported 2026-05-06
3 qwen3.5-plus 64.19 Imported 2026-05-06
4 qwen3.5-397b-a17b 64.18 Imported 2026-05-06
5 mimo-v2.5-pro 63.30 Imported 2026-05-06
6 GLM-5.1 62.93 Imported 2026-05-06
7 doubao-seed-2.0-code 62.36 Imported 2026-05-06
8 GLM-5-Turbo 61.92 Imported 2026-05-06
9 deepseek-v4-flash 61.47 Imported 2026-05-06
10 doubao-seed-2.0-pro 61.07 Imported 2026-05-06
11 Claude Sonnet 4.6 60.50 Imported 2026-05-06
12 doubao-seed-2.0-lite 60.40 Imported 2026-05-06
13 mimo-v2.5 60.39 Imported 2026-05-06
14 qwen3.6-plus 60.20 Imported 2026-05-06
15 DeepSeek-V3.2 60.13 Imported 2026-05-06
16 DeepSeek-V3.2 60.12 Imported 2026-05-06
17 huanyuan-3.0-preview 59.39 Imported 2026-05-06
18 kimi-k2.6 59.31 Imported 2026-05-06
19 doubao-seed-code 59.22 Imported 2026-05-06
20 qwen3.6-plus 59.05 Imported 2026-05-06
21 LongCat-2.0-Preview 58.80 Imported 2026-05-06
22 qwen3.6-27b 58.74 Imported 2026-05-06
23 kimi-k2.5 58.49 Imported 2026-05-06
24 DeepSeekV3.2 57.94 Imported 2026-05-06
25 mimo-v2-pro 57.92 Imported 2026-05-06
26 mimo-v2-omni 57.65 Imported 2026-05-06
27 LongCat-Flash-Thinking-2601 57.48 Imported 2026-05-06
28 Ling-2.6-1T 57.40 Imported 2026-05-06
29 qwen3.6-max-preview 57.40 Imported 2026-05-06
30 kimi-k2.6-code-preview 57.14 Imported 2026-05-06
31 GLM-5 57.05 Imported 2026-05-06
32 qwen3.6-35b-a3b 56.94 Imported 2026-05-06
33 gpt-5.4 56.73 Imported 2026-05-06
34 qwen3.6-flash 56.55 Imported 2026-05-06
35 GLM-4.6 56.29 Imported 2026-05-06
36 qwen3-max-2026-01-23 55.76 Imported 2026-05-06
37 kat-coder-pro-v2 54.74 Imported 2026-05-06
38 GLM-4.7 54.58 Imported 2026-05-06
39 gemini-3.1-pro-preview 53.95 Imported 2026-05-06
40 hunyuan-2.0-thinking 52.69 Imported 2026-05-06
41 MiniMax-M2.5 51.79 Imported 2026-05-06
42 gemma-4-31b-it 51.59 Imported 2026-05-06
43 Ling-2.5-1T 51.19 Imported 2026-05-06
44 DeepSeek-R1 50.23 Imported 2026-05-06
45 MiniMax-M2.7 49.53 Imported 2026-05-06
46 kimi-for-coding-k2.6 49.04 Imported 2026-05-06
47 gemini-3-flash-preview 48.99 Imported 2026-05-06
48 MiniMax-M2.1 48.08 Imported 2026-05-06
49 Kimi-K2-Thinking 47.83 Imported 2026-05-06
50 hunyuan-2.0-instruct 46.94 Imported 2026-05-06
51 qwen3-coder-next 46.84 Imported 2026-05-06
52 mistral-small-2603 45.26 Imported 2026-05-06
53 grok-4.20 43.04 Imported 2026-05-06
54 kimi-for-coding-k2.5 42.72 Imported 2026-05-06
55 step-3.5-flash-2603 42.59 Imported 2026-05-06
56 step-3.5-flash 41.75 Imported 2026-05-06
57 Spark X2 41.44 Imported 2026-05-06
58 step-3.5-flash 38.74 Imported 2026-05-06
59 hunyuan-t1 34.74 Imported 2026-05-06
60 ERNIE-4.5-Turbo 33.68 Imported 2026-05-06
61 Ling-2.6-Flash 27.04 Imported 2026-05-06