AutomationBench
Zapier benchmark for evaluating AI agents on end-to-end business workflow execution across sales, marketing, operations, support, finance, and HR environments.
14rows
task_success_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Task Success Rate, Cost / Task (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Task Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 15.5% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | GPT-5.5 | 12.9% | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 9.9% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 4 | Gemini 3.1 Pro Preview | 9.6% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
| 1 | Gemini 3.5 Flash (Medium) | 14.50 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-21 |
| 2 | GPT-5.5 (XHigh) | 12.90 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-21 |
| 3 | Gemini 3.5 Flash (High) | 12.60 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-21 |
| 4 | Gemini 3.5 Flash (Low) | 12.20 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-21 |
| 5 | GPT-5.5 (High) | 11.30 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-21 |
| 6 | Claude Opus 4.7 (Max) | 9.90 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-21 |
| 7 | Gemini 3.1 Pro (High) | 9.60 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-21 |
| 8 | GPT-5.5 (Medium) | 8.50 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-21 |
| 9 | Claude Opus 4.7 (High) | 8.40 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-21 |
| 10 | Claude Opus 4.7 (XHigh) | 8.20 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-21 |
No matching rows.