RoboBench
Embodied-brain benchmark for multimodal LLMs across perception, instruction comprehension, planning, affordance prediction, and failure analysis.
16rows
overall_dimension_averageprimary metric
2026-05-27sampled
Metadata
Metrics
Overall Dimension Average, Perception Reasoning Avg, Instruction Comprehension Avg, Generalized Planning Avg, Affordance Prediction Avg, Failure Analysis Avg
| Rank | Subject | Overall Dimension Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Human Evaluation | 67.19 | — | Imported | 2026-05-27 |
| 2 | Gemini-2.5-Pro | 50.10 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
| 3 | Gemini-2.5-Flash | 45.06 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-27 |
| 4 | Gemini-2.0-Flash | 45.04 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 5 | Qwen-VL-Max | 42.43 | Qwen VL Max qwen-qwen-vl-max | Imported | 2026-05-27 |
| 6 | Claude-3.7-Sonnet | 40.53 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-27 |
| 7 | Qwen2.5-VL-72B-Ins | 40.51 | — | Imported | 2026-05-27 |
| 8 | GPT-4o | 40.16 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 9 | Claude-3.5-Sonnet | 37.82 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-27 |
| 10 | RoboBrain-2.0-7B | 36.59 | — | Imported | 2026-05-27 |
| 11 | GPT-4o-Mini | 34.40 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 12 | Qwen-VL-Plus | 31.64 | Qwen VL Plus qwen-qwen-vl-plus | Imported | 2026-05-27 |
| 13 | GPT-4o-text-only | 30.23 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 14 | Qwen2.5-VL-7B-Ins | 25.57 | — | Imported | 2026-05-27 |
| 15 | LLaVA-OneVision-7B | 24.91 | — | Imported | 2026-05-27 |
| 16 | LLaVA-OneVision-0.5B | 16.96 | — | Imported | 2026-05-27 |
No matching rows.