Workspace-Bench
Workspace-agent benchmark over file-heavy tasks involving documents, spreadsheets, presentations, code, and multi-file dependencies.
45rows
rubric_pass_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Rubric Pass Rate, Easy Rubrics Accuracy, Medium Rubrics Accuracy, Hard Rubrics Accuracy, Pass@50, Pass@60, Pass@80
| Rank | Subject | Rubric Pass Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | OpenClaw + Opus-4.7 | 66.7% | — | Verified | 2026-05-27 |
| 2 | ClaudeCode + Opus-4.7 | 64.7% | — | Verified | 2026-05-27 |
| 3 | Hermes + Opus-4.7 | 64.5% | — | Verified | 2026-05-27 |
| 4 | DeepAgent + GLM-5.1 | 61% | — | Verified | 2026-05-27 |
| 5 | Hermes + GLM-5.1 | 57.7% | — | Verified | 2026-05-27 |
| 6 | OpenClaw + GLM-5.1 | 57.5% | — | Verified | 2026-05-27 |
| 7 | OpenClaw + Qwen-3.6-Plus | 55.6% | — | Verified | 2026-05-27 |
| 8 | ClaudeCode + MiniMax-M2.7 | 54.6% | — | Verified | 2026-05-27 |
| 9 | DeepAgent + Opus-4.7 | 54.4% | — | Verified | 2026-05-27 |
| 10 | Codex + GLM-5.1 | 53.1% | — | Verified | 2026-05-27 |
| 11 | ClaudeCode + Seed-2.0-Lite | 53% | — | Verified | 2026-05-27 |
| 12 | ClaudeCode + GLM-5.1 | 52.6% | — | Verified | 2026-05-27 |
| 13 | Hermes + MiniMax-M2.7 | 52.6% | — | Verified | 2026-05-27 |
| 14 | ClaudeCode + GPT-5.4 | 51.8% | — | Verified | 2026-05-27 |
| 15 | Hermes + Qwen-3.6-Plus | 50.9% | — | Verified | 2026-05-27 |
| 16 | Hermes + Kimi-2.5 | 49.1% | — | Verified | 2026-05-27 |
| 17 | ClaudeCode + Kimi-2.5 | 48.3% | — | Verified | 2026-05-27 |
| 18 | Codex + GPT-5.4 | 47.7% | — | Verified | 2026-05-27 |
| 19 | Codex + Qwen-3.6-Plus | 47.3% | — | Verified | 2026-05-27 |
| 20 | OpenClaw + GPT-5.4 | 47.1% | — | Verified | 2026-05-27 |
| 21 | Codex + Kimi-2.5 | 46.9% | — | Verified | 2026-05-27 |
| 22 | OpenClaw + Seed-2.0-Lite | 46.6% | — | Verified | 2026-05-27 |
| 23 | Hermes + Seed-2.0-Lite | 45.9% | — | Verified | 2026-05-27 |
| 24 | DeepAgent + MiniMax-M2.7 | 45% | — | Verified | 2026-05-27 |
| 25 | OpenClaw + Kimi-2.5 | 44.5% | — | Verified | 2026-05-27 |
| 26 | Hermes + GPT-5.4 | 44.3% | — | Verified | 2026-05-27 |
| 27 | OpenClaw + MiniMax-M2.7 | 44.1% | — | Verified | 2026-05-27 |
| 28 | Codex + MiniMax-M2.7 | 42.7% | — | Verified | 2026-05-27 |
| 29 | ClaudeCode + Seed-2.0-Code | 42.3% | — | Verified | 2026-05-27 |
| 30 | DeepAgent + Kimi-2.5 | 41.6% | — | Verified | 2026-05-27 |
| 31 | OpenClaw + Seed-2.0-Code | 40.1% | — | Verified | 2026-05-27 |
| 32 | DeepAgent + Qwen-3.6-Plus | 39.4% | — | Verified | 2026-05-27 |
| 33 | Hermes + Seed-2.0-Code | 38.6% | — | Verified | 2026-05-27 |
| 34 | ClaudeCode + Gemini-3.1-Pro | 37.5% | — | Verified | 2026-05-27 |
| 35 | DeepAgent + Gemini-3.1-Pro | 37.2% | — | Verified | 2026-05-27 |
| 36 | Codex + Grok-4.3 | 36.9% | — | Verified | 2026-05-27 |
| 37 | DeepAgent + GPT-5.4 | 36.2% | — | Verified | 2026-05-27 |
| 38 | Hermes + Grok-4.3 | 36.2% | — | Verified | 2026-05-27 |
| 39 | DeepAgent + Seed-2.0-Lite | 36% | — | Verified | 2026-05-27 |
| 40 | DeepAgent + Seed-2.0-Code | 34.6% | — | Verified | 2026-05-27 |
| 41 | OpenClaw + Grok-4.3 | 34.4% | — | Verified | 2026-05-27 |
| 42 | Codex + Gemini-3.1-Pro | 31.9% | — | Verified | 2026-05-27 |
| 43 | OpenClaw + Gemini-3.1-Pro | 31.6% | — | Verified | 2026-05-27 |
| 44 | Hermes + Gemini-3.1-Pro | 27.1% | — | Verified | 2026-05-27 |
| 45 | DeepAgent + Grok-4.3 | 13.8% | — | Verified | 2026-05-27 |
No matching rows.