ARC-AGI v2

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

15rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.5 0.85 GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-06
2 Gemini 3.1 Pro 0.77 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-06
3 GPT-5.4 0.73 GPT-5.4
openai-gpt-5.4
Self-reported 2026-05-06
4 Claude Opus 4.6 0.69 Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-06
5 Claude Sonnet 4.6 0.58 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Self-reported 2026-05-06
6 GPT-5.2 Pro 0.54 GPT-5.2 Pro
openai-gpt-5.2-pro
Self-reported 2026-05-06
7 GPT-5.2 0.53 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
8 Muse Spark 0.42 Self-reported 2026-05-06
9 Claude Opus 4.5 0.38 Claude Opus 4.5
anthropic-claude-opus-4.5
Self-reported 2026-05-06
10 Gemini 3 Flash 0.34 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
11 Gemini 3 Pro 0.31 Gemini 3
google-gemini-3
Self-reported 2026-05-06
12 Grok-4 0.16 GROK Grok 4
x-ai-grok-4
Self-reported 2026-05-06
13 Claude Opus 4 0.09 Claude Opus 4
anthropic-claude-opus-4
Imported 2026-05-06
14 o3 0.07 o3
openai-o3
Imported 2026-05-06
15 Gemini 2.5 Pro 0.05 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06