RoboBench

Embodied-brain benchmark for multimodal LLMs across perception, instruction comprehension, planning, affordance prediction, and failure analysis.

16rows
overall_dimension_averageprimary metric
2026-05-27sampled

Metadata

Metrics

Overall Dimension Average, Perception Reasoning Avg, Instruction Comprehension Avg, Generalized Planning Avg, Affordance Prediction Avg, Failure Analysis Avg

Latest Results

Rows parsed from the public RoboBench tables. Overall dimension average is a BenchmarkList-derived average of the five published dimension averages.

Rank Subject Overall Dimension Average Model Match Provenance Sampled
1 Human Evaluation 67.19 Imported 2026-05-27
2 Gemini-2.5-Pro 50.10 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-27
3 Gemini-2.5-Flash 45.06 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-27
4 Gemini-2.0-Flash 45.04 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-27
5 Qwen-VL-Max 42.43 Qwen VL Max
qwen-qwen-vl-max
Imported 2026-05-27
6 Claude-3.7-Sonnet 40.53 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-27
7 Qwen2.5-VL-72B-Ins 40.51 Imported 2026-05-27
8 GPT-4o 40.16 GPT-4o
openai-gpt-4o
Imported 2026-05-27
9 Claude-3.5-Sonnet 37.82 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
10 RoboBrain-2.0-7B 36.59 Imported 2026-05-27
11 GPT-4o-Mini 34.40 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
12 Qwen-VL-Plus 31.64 Qwen VL Plus
qwen-qwen-vl-plus
Imported 2026-05-27
13 GPT-4o-text-only 30.23 GPT-4o
openai-gpt-4o
Imported 2026-05-27
14 Qwen2.5-VL-7B-Ins 25.57 Imported 2026-05-27
15 LLaVA-OneVision-7B 24.91 Imported 2026-05-27
16 LLaVA-OneVision-0.5B 16.96 Imported 2026-05-27