ARC-AGI-3

Interactive ARC-AGI benchmark variant that evaluates agents adapting to novel grid-based environments, with an official public leaderboard.

6rows
scoreprimary metric
2026-05-05sampled

Metadata

Metrics

Score, Cost/task (lower is better), Total cost (lower is better)

Latest Results

Scores are stored as percentages. Rows preserve ARC Prize display names because the leaderboard includes base models, reasoning configurations, custom competition systems, and agent systems.

Rank Subject Score Model Match Provenance Sampled
1 Anthropic Opus 4.6 (Max) 0.51 Imported 2026-05-05
2 GPT-5.5 (High) 0.43 GPT-5.5
openai-gpt-5.5
Imported 2026-05-05
3 Gemini 3.1 Pro (Preview) 0.42 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-05
4 GPT-5.4 (High) 0.21 GPT-5.4
openai-gpt-5.4
Imported 2026-05-05
5 Opus 4.7 (High) 0.18 Imported 2026-05-05
6 Grok 4.20 (Beta Reasoning) 0.09 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-05