ARC-AGI v2

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

15rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: arc_agi_v2
Category: Reasoning
Release: 2025-03-24
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	GPT-5.5	0.85	GPT-5.5 openai-gpt-5.5	Self-reported	2026-05-06
2	Gemini 3.1 Pro	0.77	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Self-reported	2026-05-06
3	GPT-5.4	0.73	GPT-5.4 openai-gpt-5.4	Self-reported	2026-05-06
4	Claude Opus 4.6	0.69	Claude Opus 4.6 anthropic-claude-opus-4.6	Self-reported	2026-05-06
5	Claude Sonnet 4.6	0.58	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Self-reported	2026-05-06
6	GPT-5.2 Pro	0.54	GPT-5.2 Pro openai-gpt-5.2-pro	Self-reported	2026-05-06
7	GPT-5.2	0.53	GPT-5.2 openai-gpt-5.2	Self-reported	2026-05-06
8	Muse Spark	0.42	—	Self-reported	2026-05-06
9	Claude Opus 4.5	0.38	Claude Opus 4.5 anthropic-claude-opus-4.5	Self-reported	2026-05-06
10	Gemini 3 Flash	0.34	Gemini 3 Flash Preview google-gemini-3-flash-preview	Self-reported	2026-05-06
11	Gemini 3 Pro	0.31	Gemini 3 google-gemini-3	Self-reported	2026-05-06
12	Grok-4	0.16	GROK Grok 4 x-ai-grok-4	Self-reported	2026-05-06
13	Claude Opus 4	0.09	Claude Opus 4 anthropic-claude-opus-4	Imported	2026-05-06
14	o3	0.07	o3 openai-o3	Imported	2026-05-06
15	Gemini 2.5 Pro	0.05	Gemini 2.5 Pro google-gemini-2.5-pro	Imported	2026-05-06

Metadata

Metrics

Latest Results