ResearchClawBench

Benchmark for AI coding and research agents that asks systems to conduct scientific research from raw data to publication-quality reports, with tasks grounded in human-authored target studies.

22rows
average_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Average Score, Task Coverage, Average Duration (lower is better), Average Cost (lower is better)

Latest Results

Rows are imported from the public ResearchClawBench GitHub Pages static leaderboard JSON.

Rank Subject Average Score Model Match Provenance Sampled
1 Claude Code 21.53 Imported 2026-05-27
2 ResearchHarness (Claude-Opus-4.7) 20.74 Imported 2026-05-27
3 ResearchHarness (Claude-Opus-4.6) 19.87 Imported 2026-05-27
4 Codex CLI 18.42 Imported 2026-05-27
5 ResearchHarness (GLM-5.1) 18.19 Imported 2026-05-27
6 ResearchHarness (Qwen3.6-Plus) 18.00 Imported 2026-05-27
7 ResearchHarness (Kimi-K2.6) 18.00 Imported 2026-05-27
8 ResearchHarness (DeepSeek-V4-Pro) 17.12 Imported 2026-05-27
9 ResearchHarness (GPT-5.5) 17.00 Imported 2026-05-27
10 ResearchHarness (MiMo-V2.5) 16.91 Imported 2026-05-27
11 OpenClaw 16.64 Imported 2026-05-27
12 ResearchClaw 16.28 Imported 2026-05-27
13 EvoScientist 15.47 Imported 2026-05-27
14 ResearchHarness (MiMo-V2-Pro) 15.34 Imported 2026-05-27
15 ResearchHarness (GPT-5.4) 15.28 Imported 2026-05-27
16 ResearchHarness (Qwen3.5-397B-A17B) 14.23 Imported 2026-05-27
17 ResearchHarness (Kimi-K2.5) 13.96 Imported 2026-05-27
18 ARIS Codex 13.58 Imported 2026-05-27
19 ResearchHarness (Grok-4.1) 13.50 Imported 2026-05-27
20 ResearchHarness (Gemini-3.1-Pro) 13.28 Imported 2026-05-27
21 Nanobot 12.81 Imported 2026-05-27
22 ResearchHarness (Grok-4.3) 12.41 Imported 2026-05-27