AutomationBench | BenchmarkList

Metadata

Task Success Rate, Cost / Task (lower is better)

Showing 2 latest source slices.

Rank	Subject	Task Success Rate	Model Match	Provenance	Sampled
1	Claude Opus 4.8	15.5%	Claude Opus 4.8 anthropic-claude-opus-4.8	Self-reported	2026-05-28
2	GPT-5.5	12.9%	GPT-5.5 openai-gpt-5.5	Self-reported	2026-05-28
3	Claude Opus 4.7	9.9%	Claude Opus 4.7 anthropic-claude-opus-4.7	Self-reported	2026-05-28
4	Gemini 3.1 Pro Preview	9.6%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Self-reported	2026-05-28
1	Gemini 3.5 Flash (Medium)	14.50	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-21
2	GPT-5.5 (XHigh)	12.90	GPT-5.5 openai-gpt-5.5	Imported	2026-05-21
3	Gemini 3.5 Flash (High)	12.60	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-21
4	Gemini 3.5 Flash (Low)	12.20	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-21
5	GPT-5.5 (High)	11.30	GPT-5.5 openai-gpt-5.5	Imported	2026-05-21
6	Claude Opus 4.7 (Max)	9.90	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-21
7	Gemini 3.1 Pro (High)	9.60	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-21
8	GPT-5.5 (Medium)	8.50	GPT-5.5 openai-gpt-5.5	Imported	2026-05-21
9	Claude Opus 4.7 (High)	8.40	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-21
10	Claude Opus 4.7 (XHigh)	8.20	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-21