Workspace-Bench

Workspace-agent benchmark over file-heavy tasks involving documents, spreadsheets, presentations, code, and multi-file dependencies.

45rows
rubric_pass_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Rubric Pass Rate, Easy Rubrics Accuracy, Medium Rubrics Accuracy, Hard Rubrics Accuracy, Pass@50, Pass@60, Pass@80

Latest Results

Rows are imported from the official static Workspace-Bench-Lite leaderboard JSON generated from detailed_rubrics_pass_table_all_runs.csv.

Rank Subject Rubric Pass Rate Model Match Provenance Sampled
1 OpenClaw + Opus-4.7 66.7% Verified 2026-05-27
2 ClaudeCode + Opus-4.7 64.7% Verified 2026-05-27
3 Hermes + Opus-4.7 64.5% Verified 2026-05-27
4 DeepAgent + GLM-5.1 61% Verified 2026-05-27
5 Hermes + GLM-5.1 57.7% Verified 2026-05-27
6 OpenClaw + GLM-5.1 57.5% Verified 2026-05-27
7 OpenClaw + Qwen-3.6-Plus 55.6% Verified 2026-05-27
8 ClaudeCode + MiniMax-M2.7 54.6% Verified 2026-05-27
9 DeepAgent + Opus-4.7 54.4% Verified 2026-05-27
10 Codex + GLM-5.1 53.1% Verified 2026-05-27
11 ClaudeCode + Seed-2.0-Lite 53% Verified 2026-05-27
12 ClaudeCode + GLM-5.1 52.6% Verified 2026-05-27
13 Hermes + MiniMax-M2.7 52.6% Verified 2026-05-27
14 ClaudeCode + GPT-5.4 51.8% Verified 2026-05-27
15 Hermes + Qwen-3.6-Plus 50.9% Verified 2026-05-27
16 Hermes + Kimi-2.5 49.1% Verified 2026-05-27
17 ClaudeCode + Kimi-2.5 48.3% Verified 2026-05-27
18 Codex + GPT-5.4 47.7% Verified 2026-05-27
19 Codex + Qwen-3.6-Plus 47.3% Verified 2026-05-27
20 OpenClaw + GPT-5.4 47.1% Verified 2026-05-27
21 Codex + Kimi-2.5 46.9% Verified 2026-05-27
22 OpenClaw + Seed-2.0-Lite 46.6% Verified 2026-05-27
23 Hermes + Seed-2.0-Lite 45.9% Verified 2026-05-27
24 DeepAgent + MiniMax-M2.7 45% Verified 2026-05-27
25 OpenClaw + Kimi-2.5 44.5% Verified 2026-05-27
26 Hermes + GPT-5.4 44.3% Verified 2026-05-27
27 OpenClaw + MiniMax-M2.7 44.1% Verified 2026-05-27
28 Codex + MiniMax-M2.7 42.7% Verified 2026-05-27
29 ClaudeCode + Seed-2.0-Code 42.3% Verified 2026-05-27
30 DeepAgent + Kimi-2.5 41.6% Verified 2026-05-27
31 OpenClaw + Seed-2.0-Code 40.1% Verified 2026-05-27
32 DeepAgent + Qwen-3.6-Plus 39.4% Verified 2026-05-27
33 Hermes + Seed-2.0-Code 38.6% Verified 2026-05-27
34 ClaudeCode + Gemini-3.1-Pro 37.5% Verified 2026-05-27
35 DeepAgent + Gemini-3.1-Pro 37.2% Verified 2026-05-27
36 Codex + Grok-4.3 36.9% Verified 2026-05-27
37 DeepAgent + GPT-5.4 36.2% Verified 2026-05-27
38 Hermes + Grok-4.3 36.2% Verified 2026-05-27
39 DeepAgent + Seed-2.0-Lite 36% Verified 2026-05-27
40 DeepAgent + Seed-2.0-Code 34.6% Verified 2026-05-27
41 OpenClaw + Grok-4.3 34.4% Verified 2026-05-27
42 Codex + Gemini-3.1-Pro 31.9% Verified 2026-05-27
43 OpenClaw + Gemini-3.1-Pro 31.6% Verified 2026-05-27
44 Hermes + Gemini-3.1-Pro 27.1% Verified 2026-05-27
45 DeepAgent + Grok-4.3 13.8% Verified 2026-05-27