Blueprint-Bench 2
Spatial-reasoning benchmark measuring how accurately models convert apartment photos into 2D floor plans.
14rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Normalized Score, Raw Connectivity Similarity, Normalized Standard Error (lower is better), Raw Standard Error (lower is better)
| Rank | Subject | Normalized Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Human* | 0.809 | — | Imported | 2026-05-28 |
| 2 | GPT 5.5 | 0.706 +/- 0.008 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 3 | Gemini 3.5 Flash | 0.694 +/- 0.006 | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 4 | GPT 5.4 | 0.664 +/- 0.018 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 5 | Gemini 3.1 Pro | 0.661 +/- 0.011 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 6 | Claude Opus 4.7 | 0.652 +/- 0.009 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 7 | Claude Opus 4.8 | 0.606 +/- 0.010 | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 8 | Claude Sonnet 4.6 | 0.570 +/- 0.011 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 9 | Kimi K2.6 | 0.557 +/- 0.015 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 10 | Claude Haiku 4.5 | 0.367 +/- 0.017 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 11 | Gemini 3 Flash | 0.534 +/- 0.019 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 12 | Gemini Robotics-ER 1.6 | 0.475 +/- 0.021 | — | Imported | 2026-05-28 |
| 13 | Grok 4.20 Reasoning | 0.170 +/- 0.011 | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 14 | Grok 4.3 | 0.477 +/- 0.024 | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
No matching rows.