From Perception to Action

Interactive 3D vision-reasoning benchmark where models plan physical actions in puzzle and stacking environments.

16rows
pass_at_1primary metric
2026-05-28sampled

Metadata

Metrics

pass@1, Successful Tasks, Puzzle Success Rate, Stacking Success Rate, Average Steps (lower is better), Distance to Optimal (lower is better), Normalized Distance (lower is better), Solved/Tokens (Reported), Solved/USD

Latest Results

Rows are imported from public arXiv source LaTeX. The benchmark evaluates models in an interactive 3D vision-reasoning setting with puzzle and stacking tasks.

Rank Subject pass@1 Model Match Provenance Sampled
1 GPT-5.2 22.9% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
2 Gemini-3-Pro 19.3% Gemini 3
google-gemini-3
Imported 2026-05-28
3 Claude-Opus-4.5 15.6% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
4 Claude-Sonnet-4.5 13.8% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
5 Kimi-k2.5 13.8% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
6 Gemini-3-Flash 11.9% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
7 GPT-5-mini 11% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
8 OpenAI-o3 10.1% o3
openai-o3
Imported 2026-05-28
9 Qwen3-VL-30B-A3B-Thk. 10.1% Imported 2026-05-28
10 Seed-1.6 10.1% Seed 1.6
bytedance-seed-seed-1.6
Imported 2026-05-28
11 Qwen3-VL-235B-A22B-Thk. 9.2% Imported 2026-05-28
12 Qwen3-VL-235B-A22B-Inst 8.3% Imported 2026-05-28
13 Qwen3-VL-8B-Thinking 8.3% Qwen3 VL 8B Thinking
qwen-qwen3-vl-8b-thinking
Imported 2026-05-28
14 GLM-4.6V 7.3% GLM GLM 4.6V
z-ai-glm-4.6v
Imported 2026-05-28
15 Seed-1.6-Flash 7.3% Seed 1.6 Flash
bytedance-seed-seed-1.6-flash
Imported 2026-05-28
16 Qwen3-VL-30B-A3B-Inst 3.7% Imported 2026-05-28