MMSI-Bench

Multi-image spatial intelligence VQA benchmark with 1,000 human-designed questions across real-world 3D scene understanding, robotics, driving, and motion reasoning.

41rows
average_accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Average Accuracy

Latest Results

Rows are imported from the official MMSI-Bench GitHub README leaderboard. Baseline rows are preserved with non-model subject types.

Rank Subject Average Accuracy Model Match Provenance Sampled
1 Human Level 97.2% Imported 2026-05-28
2 Gemini-3-pro 49.2% Gemini 3
google-gemini-3
Imported 2026-05-28
3 SenseNova-SI-1.2-InternVL3-8B 42.6% Imported 2026-05-28
4 GPT-5 41.9% GPT-5
openai-gpt-5
Imported 2026-05-28
5 o3 41% o3
openai-o3
Imported 2026-05-28
6 GPT-4.5 40.3% GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-28
7 SenseNova-SI-1.1-Qwen3-VL-8B 38.1% Imported 2026-05-28
8 Gemini-2.5-Pro--Thinking 37% Imported 2026-05-28
9 Gemini-2.5-Pro 36.9% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
10 SenseNova-SI-1.1-BAGEL-7B-MoT 34.5% Imported 2026-05-28
11 Doubao-1.5-pro 33% Imported 2026-05-28
12 SenseNova-SI-1.1-Qwen2.5-VL-7B 32.8% Imported 2026-05-28
13 GPT-4.1 30.9% GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
14 SenseNova-SI-1.1-Qwen2.5-VL-3B 30.8% Imported 2026-05-28
15 Qwen2.5-VL-72B 30.7% Qwen2.5 VL 72B Instruct
qwen-qwen2.5-vl-72b-instruct
Imported 2026-05-28
16 NVILA-15B 30.5% Imported 2026-05-28
17 GPT-4o 30.3% GPT-4o
openai-gpt-4o
Imported 2026-05-28
18 Claude-3.7-Sonnet--Thinking 30.2% Claude 3.7 Sonnet (thinking)
anthropic-claude-3.7-sonnet-thinking
Imported 2026-05-28
19 Seed1.5-VL 29.7% Imported 2026-05-28
20 InternVL2.5-2B 29% Imported 2026-05-28
21 InternVL2.5-8B 28.7% Imported 2026-05-28
22 DeepSeek-VL2-Small 28.6% Imported 2026-05-28
23 InternVL3-78B 28.5% Imported 2026-05-28
24 InternVL2.5-78B 28.5% Imported 2026-05-28
25 LLaVA-OneVision-72B 28.4% Imported 2026-05-28
26 NVILA-8B 28.1% Imported 2026-05-28
27 InternVL2.5-26B 28% Imported 2026-05-28
28 DeepSeek-VL2 27.1% Imported 2026-05-28
29 InternVL3-1B 27% Imported 2026-05-28
30 InternVL3-9B 26.7% Imported 2026-05-28
31 Qwen2.5-VL-3B 26.5% Imported 2026-05-28
32 InternVL2.5-4B 26.3% Imported 2026-05-28
33 InternVL2.5-1B 26.1% Imported 2026-05-28
34 Qwen2.5-VL-7B 25.9% Imported 2026-05-28
35 InternVL3-8B 25.7% Imported 2026-05-28
36 InternVL3-2B 25.3% Imported 2026-05-28
37 Llama-3.2-11B-Vision 25.4% Imported 2026-05-28
38 Random Guessing 25% Imported 2026-05-28
39 LLaVA-OneVision-7B 24.5% Imported 2026-05-28
40 DeepSeek-VL2-Tiny 24% Imported 2026-05-28
41 Blind GPT-4o 22.7% Imported 2026-05-28