SpreadsheetBench
Spreadsheet-agent benchmark for real Excel tasks and business spreadsheet workflows, including financial modeling, debugging, and visualization.
33rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Template, Financial Modeling, Debug, Visualization
Showing 4 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 Max | 89.3% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 2 | Qwen3.7 Max | 87% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 3 | GLM-5.1 Thinking | 85.2% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 4 | DeepSeek V4 Pro Max | 84.9% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 5 | Kimi K2.6 Thinking | 84.5% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 6 | Qwen3.6 Plus | 80.2% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | Gemini in Google Sheets | 70.48% | — | Verified | 2026-05-27 |
| 2 | Qingqiu Agent | 69.96% | — | Verified | 2026-05-27 |
| 3 | Univer | 68.86% | — | Verified | 2026-05-27 |
| 4 | 灵犀 | 66.89% | — | Verified | 2026-05-27 |
| 5 | Bluebox | 62.9% | — | Verified | 2026-05-27 |
| 6 | Shortcut.ai | 59.25% | — | Verified | 2026-05-27 |
| 7 | Copilot in Excel (Agent Mode) | 57.2% | — | Imported | 2026-05-27 |
| 8 | ChatGPT Agent w/ .xlsx | 45.5% | — | Imported | 2026-05-27 |
| 9 | Claude Files Opus 4.1 | 42.9% | — | Imported | 2026-05-27 |
| 10 | ChatGPT Agent | 35.3% | — | Imported | 2026-05-27 |
| 11 | OpenAI o3 | 23.3% | — | Imported | 2026-05-27 |
| 1 | Qingqiu Agent | 94.75% | — | Verified | 2026-05-27 |
| 2 | Tetra-Beta-2 | 94.25% | — | Verified | 2026-05-27 |
| 3 | GPT for Excel | 92.5% | — | Verified | 2026-05-27 |
| 4 | WPS AI (Seed 2.0) | 91.25% | — | Verified | 2026-05-27 |
| 5 | Nobie Agent | 91% | — | Verified | 2026-05-27 |
| 6 | Shortcut.ai | 86% | — | Verified | 2026-05-27 |
| 7 | Kyra | 84.25% | — | Verified | 2026-05-27 |
| 8 | Decide Agent | 82.5% | — | Verified | 2026-05-27 |
| 1 | Claude Opus 4.6 (Bash Agent) | 34.89% | — | Verified | 2026-05-27 |
| 2 | GPT-5.2 (Bash Agent) | 26.79% | — | Verified | 2026-05-27 |
| 3 | Gemini 3.1 Pro (Bash Agent) | 23.68% | — | Verified | 2026-05-27 |
| 4 | GLM-5.0 (Bash Agent) | 17.14% | — | Verified | 2026-05-27 |
| 5 | Deepseek-V3.2 (Bash Agent) | 15.58% | — | Verified | 2026-05-27 |
| 6 | Kimi K2.5 (Bash Agent) | 14.64% | — | Verified | 2026-05-27 |
| 7 | Qwen3.5-397B-A17B (Bash Agent) | 11.22% | — | Verified | 2026-05-27 |
| 8 | MiniMax M2.5 (Bash Agent) | 7.17% | — | Verified | 2026-05-27 |
No matching rows.