AutomationBench

Zapier benchmark for evaluating AI agents on end-to-end business workflow execution across sales, marketing, operations, support, finance, and HR environments.

14rows
task_success_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Task Success Rate, Cost / Task (lower is better)

Showing 2 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Task Success Rate Model Match Provenance Sampled
1 Claude Opus 4.8 15.5% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
2 GPT-5.5 12.9% GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-28
3 Claude Opus 4.7 9.9% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
4 Gemini 3.1 Pro Preview 9.6% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-28
1 Gemini 3.5 Flash (Medium) 14.50 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-21
2 GPT-5.5 (XHigh) 12.90 GPT-5.5
openai-gpt-5.5
Imported 2026-05-21
3 Gemini 3.5 Flash (High) 12.60 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-21
4 Gemini 3.5 Flash (Low) 12.20 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-21
5 GPT-5.5 (High) 11.30 GPT-5.5
openai-gpt-5.5
Imported 2026-05-21
6 Claude Opus 4.7 (Max) 9.90 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-21
7 Gemini 3.1 Pro (High) 9.60 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-21
8 GPT-5.5 (Medium) 8.50 GPT-5.5
openai-gpt-5.5
Imported 2026-05-21
9 Claude Opus 4.7 (High) 8.40 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-21
10 Claude Opus 4.7 (XHigh) 8.20 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-21