MMSI-Bench
Multi-image spatial intelligence VQA benchmark with 1,000 human-designed questions across real-world 3D scene understanding, robotics, driving, and motion reasoning.
41rows
average_accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Average Accuracy
| Rank | Subject | Average Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Human Level | 97.2% | — | Imported | 2026-05-28 |
| 2 | Gemini-3-pro | 49.2% | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 3 | SenseNova-SI-1.2-InternVL3-8B | 42.6% | — | Imported | 2026-05-28 |
| 4 | GPT-5 | 41.9% | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 5 | o3 | 41% | o3 openai-o3 | Imported | 2026-05-28 |
| 6 | GPT-4.5 | 40.3% | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-28 |
| 7 | SenseNova-SI-1.1-Qwen3-VL-8B | 38.1% | — | Imported | 2026-05-28 |
| 8 | Gemini-2.5-Pro--Thinking | 37% | — | Imported | 2026-05-28 |
| 9 | Gemini-2.5-Pro | 36.9% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 10 | SenseNova-SI-1.1-BAGEL-7B-MoT | 34.5% | — | Imported | 2026-05-28 |
| 11 | Doubao-1.5-pro | 33% | — | Imported | 2026-05-28 |
| 12 | SenseNova-SI-1.1-Qwen2.5-VL-7B | 32.8% | — | Imported | 2026-05-28 |
| 13 | GPT-4.1 | 30.9% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-28 |
| 14 | SenseNova-SI-1.1-Qwen2.5-VL-3B | 30.8% | — | Imported | 2026-05-28 |
| 15 | Qwen2.5-VL-72B | 30.7% | Qwen2.5 VL 72B Instruct qwen-qwen2.5-vl-72b-instruct | Imported | 2026-05-28 |
| 16 | NVILA-15B | 30.5% | — | Imported | 2026-05-28 |
| 17 | GPT-4o | 30.3% | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 18 | Claude-3.7-Sonnet--Thinking | 30.2% | Claude 3.7 Sonnet (thinking) anthropic-claude-3.7-sonnet-thinking | Imported | 2026-05-28 |
| 19 | Seed1.5-VL | 29.7% | — | Imported | 2026-05-28 |
| 20 | InternVL2.5-2B | 29% | — | Imported | 2026-05-28 |
| 21 | InternVL2.5-8B | 28.7% | — | Imported | 2026-05-28 |
| 22 | DeepSeek-VL2-Small | 28.6% | — | Imported | 2026-05-28 |
| 23 | InternVL3-78B | 28.5% | — | Imported | 2026-05-28 |
| 24 | InternVL2.5-78B | 28.5% | — | Imported | 2026-05-28 |
| 25 | LLaVA-OneVision-72B | 28.4% | — | Imported | 2026-05-28 |
| 26 | NVILA-8B | 28.1% | — | Imported | 2026-05-28 |
| 27 | InternVL2.5-26B | 28% | — | Imported | 2026-05-28 |
| 28 | DeepSeek-VL2 | 27.1% | — | Imported | 2026-05-28 |
| 29 | InternVL3-1B | 27% | — | Imported | 2026-05-28 |
| 30 | InternVL3-9B | 26.7% | — | Imported | 2026-05-28 |
| 31 | Qwen2.5-VL-3B | 26.5% | — | Imported | 2026-05-28 |
| 32 | InternVL2.5-4B | 26.3% | — | Imported | 2026-05-28 |
| 33 | InternVL2.5-1B | 26.1% | — | Imported | 2026-05-28 |
| 34 | Qwen2.5-VL-7B | 25.9% | — | Imported | 2026-05-28 |
| 35 | InternVL3-8B | 25.7% | — | Imported | 2026-05-28 |
| 36 | InternVL3-2B | 25.3% | — | Imported | 2026-05-28 |
| 37 | Llama-3.2-11B-Vision | 25.4% | — | Imported | 2026-05-28 |
| 38 | Random Guessing | 25% | — | Imported | 2026-05-28 |
| 39 | LLaVA-OneVision-7B | 24.5% | — | Imported | 2026-05-28 |
| 40 | DeepSeek-VL2-Tiny | 24% | — | Imported | 2026-05-28 |
| 41 | Blind GPT-4o | 22.7% | — | Imported | 2026-05-28 |
No matching rows.