Blueprint-Bench 2

Spatial-reasoning benchmark measuring how accurately models convert apartment photos into 2D floor plans.

14rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Normalized Score, Raw Connectivity Similarity, Normalized Standard Error (lower is better), Raw Standard Error (lower is better)

Latest Results

Scores are normalized with random baseline 0.539 and perfect score 1.000, matching the official Andon Labs page.

Rank Subject Normalized Score Model Match Provenance Sampled
1 Human* 0.809 Imported 2026-05-28
2 GPT 5.5 0.706 +/- 0.008 GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
3 Gemini 3.5 Flash 0.694 +/- 0.006 Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
4 GPT 5.4 0.664 +/- 0.018 GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
5 Gemini 3.1 Pro 0.661 +/- 0.011 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
6 Claude Opus 4.7 0.652 +/- 0.009 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
7 Claude Opus 4.8 0.606 +/- 0.010 Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
8 Claude Sonnet 4.6 0.570 +/- 0.011 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
9 Kimi K2.6 0.557 +/- 0.015 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
10 Claude Haiku 4.5 0.367 +/- 0.017 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
11 Gemini 3 Flash 0.534 +/- 0.019 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
12 Gemini Robotics-ER 1.6 0.475 +/- 0.021 Imported 2026-05-28
13 Grok 4.20 Reasoning 0.170 +/- 0.011 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
14 Grok 4.3 0.477 +/- 0.024 GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28