ResearchClawBench
Benchmark for AI coding and research agents that asks systems to conduct scientific research from raw data to publication-quality reports, with tasks grounded in human-authored target studies.
22rows
average_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Average Score, Task Coverage, Average Duration (lower is better), Average Cost (lower is better)
| Rank | Subject | Average Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Code | 21.53 | — | Imported | 2026-05-27 |
| 2 | ResearchHarness (Claude-Opus-4.7) | 20.74 | — | Imported | 2026-05-27 |
| 3 | ResearchHarness (Claude-Opus-4.6) | 19.87 | — | Imported | 2026-05-27 |
| 4 | Codex CLI | 18.42 | — | Imported | 2026-05-27 |
| 5 | ResearchHarness (GLM-5.1) | 18.19 | — | Imported | 2026-05-27 |
| 6 | ResearchHarness (Qwen3.6-Plus) | 18.00 | — | Imported | 2026-05-27 |
| 7 | ResearchHarness (Kimi-K2.6) | 18.00 | — | Imported | 2026-05-27 |
| 8 | ResearchHarness (DeepSeek-V4-Pro) | 17.12 | — | Imported | 2026-05-27 |
| 9 | ResearchHarness (GPT-5.5) | 17.00 | — | Imported | 2026-05-27 |
| 10 | ResearchHarness (MiMo-V2.5) | 16.91 | — | Imported | 2026-05-27 |
| 11 | OpenClaw | 16.64 | — | Imported | 2026-05-27 |
| 12 | ResearchClaw | 16.28 | — | Imported | 2026-05-27 |
| 13 | EvoScientist | 15.47 | — | Imported | 2026-05-27 |
| 14 | ResearchHarness (MiMo-V2-Pro) | 15.34 | — | Imported | 2026-05-27 |
| 15 | ResearchHarness (GPT-5.4) | 15.28 | — | Imported | 2026-05-27 |
| 16 | ResearchHarness (Qwen3.5-397B-A17B) | 14.23 | — | Imported | 2026-05-27 |
| 17 | ResearchHarness (Kimi-K2.5) | 13.96 | — | Imported | 2026-05-27 |
| 18 | ARIS Codex | 13.58 | — | Imported | 2026-05-27 |
| 19 | ResearchHarness (Grok-4.1) | 13.50 | — | Imported | 2026-05-27 |
| 20 | ResearchHarness (Gemini-3.1-Pro) | 13.28 | — | Imported | 2026-05-27 |
| 21 | Nanobot | 12.81 | — | Imported | 2026-05-27 |
| 22 | ResearchHarness (Grok-4.3) | 12.41 | — | Imported | 2026-05-27 |
No matching rows.