OfficeBench

Office workflow agent benchmark spanning Word, Excel, PDF, email, calendar, and multi-application task completion.

8rows
overall_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Overall Score, Single-App Success, Two-App Success, Three-App Success

Latest Results

Rows parsed from the public OfficeBench README leaderboard. Scores are task success percentages by number of applications required.

Rank Subject Overall Score Model Match Provenance Sampled
1 Gemni-1.0 Pro (Feb 2024) 12.33 Imported 2026-05-27
2 Gemni-1.5 Flash (May 2024) 18.67 Imported 2026-05-27
3 Gemni-1.5 Pro (May 2024) 26.00 Imported 2026-05-27
4 GPT-3.5 Turbo (0125) 5.35 Imported 2026-05-27
5 GPT-4 Turbo (2024-04-09) 38.00 Imported 2026-05-27
6 GPT-4 Omni (2024-05-13) 47.00 Imported 2026-05-27
7 Llama 3 (70B-Instruct) 27.33 Imported 2026-05-27
8 Qwen 2 (72B-Instruct) 21.16 Imported 2026-05-27