SheetCopilot Benchmark

Spreadsheet control benchmark for agents operating spreadsheet software through actions rather than only answering table questions.

5rows
exec_at_1primary metric
2026-05-27sampled

Metadata

Metrics

Exec@1, Pass@1, A50 (lower is better), A90 (lower is better)

Latest Results

Rows are transcribed from public SheetCopilot NeurIPS 2023 paper Table 1. Primary score is Exec@1; Pass@1 and action-efficiency metrics A50/A90 are preserved where reported.

Rank Subject Exec@1 Model Match Provenance Sampled
1 GPT-3.5-Turbo (100% data) 87.3% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
2 GPT-3.5-Turbo (10% data) 85.0% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
3 Claude (10% data) 80.0% Imported 2026-05-27
4 VBA (100% data) 77.8% Imported 2026-05-27
5 GPT-4 (10% data) 65.0% GPT-4
openai-gpt-4
Imported 2026-05-27