BabyAI

BabyAI: Measures embodied-agent, navigation, manipulation, or simulated robotics task success.

44rows
babyai_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

BabyAI score, BALROG progress, Crafter score, TextWorld score, BabaIsAI score, MiniHack score, NetHack score

Latest Results

Rows are parsed from the public BALROG static HTML LLM and VLM leaderboard tables. The BabyAI column is used as the primary metric; broader BALROG environment columns are retained when present.

Rank Subject BabyAI score Model Match Provenance Sampled
1 Gemini-3.1-Pro (LLM) 100.0 0.0 Imported 2026-05-27
2 Gemini-3.1-Pro-Thinking (LLM) 98.0 2.0 Imported 2026-05-27
3 Gemini-3-Pro (LLM) 96.0 2.8 Imported 2026-05-27
4 Gemini-3-Flash (LLM) 86.0 4.9 Imported 2026-05-27
5 Claude-3.5-Sonnet-2024-10-22 (VLM) 82.0 5.4 Imported 2026-05-27
6 Claude-Opus-4.5 (LLM) 80.0 5.7 Imported 2026-05-27
7 Gemini-2.5-Pro-Exp-03-25 (LLM) 80.0 5.7 Imported 2026-05-27
8 GPT-5-minimal-think (LLM) 80.0 5.7 Imported 2026-05-27
9 GPT-4o-2024-05-13 (LLM) 77.6 3.7 Imported 2026-05-27
10 Grok-4 (LLM) 76.0 6.0 Imported 2026-05-27
11 Reka-Flash-3 (LLM) 76.0 6.0 Imported 2026-05-27
12 DeepSeek-R1 (LLM) 74.0 6.2 Imported 2026-05-27
13 Gemini-2.5-Pro-Exp-03-25 (VLM) 74.0 6.2 Imported 2026-05-27
14 Llama-3.1-70B-it (LLM) 73.2 4.0 Imported 2026-05-27
15 Claude-Opus-4.5-Thinking (LLM) 72.0 6.3 Imported 2026-05-27
16 Llama-3.2-90B-it (LLM) 72.0 6.3 Imported 2026-05-27
17 Claude-3.5-Sonnet-2024-10-22 (LLM) 68.0 6.6 Imported 2026-05-27
18 Gemini-2.5-Flash (LLM) 68.0 6.6 Imported 2026-05-27
19 Claude-Haiku-4.5 (LLM) 66.0 6.7 Imported 2026-05-27
20 Llama-3.2-90B-it (VLM) 66.0 6.7 Imported 2026-05-27
21 Llama-3.3-70B-it (LLM) 66.0 6.7 Imported 2026-05-27
22 GPT-4o-2024-05-13 (VLM) 62.0 4.3 Imported 2026-05-27
23 Grok-3-beta (LLM) 62.0 6.9 Imported 2026-05-27
24 Gemini-1.5-Pro-002 (LLM) 58.4 4.4 Imported 2026-05-27
25 Gemini-1.5-Pro-002 (VLM) 58.4 4.4 Imported 2026-05-27
26 Claude-3.5-Haiku-2024-10-22 (LLM) 52.0 7.1 Imported 2026-05-27
27 GPT-4o-mini-2024-07-18 (LLM) 50.4 4.5 Imported 2026-05-27
28 Gemini-1.5-Flash-002 (LLM) 50.0 7.1 Imported 2026-05-27
29 Llama-3.2-11B-it (LLM) 50.0 7.1 Imported 2026-05-27
30 Mistral-Nemo-it-2407 (LLM) 50.0 7.1 Imported 2026-05-27
31 DeepSeek-R1-Distill-Qwen-32B (LLM) 48.0 7.1 Imported 2026-05-27
32 Gemini-1.5-Flash-002 (VLM) 43.2 4.4 Imported 2026-05-27
33 GPT-4o-mini-2024-07-18 (VLM) 38.0 4.3 Imported 2026-05-27
34 Llama-3.1-8B-it (LLM) 36.0 6.8 Imported 2026-05-27
35 Qwen2-VL-72B-it (VLM) 34.0 6.7 Imported 2026-05-27
36 Qwen2.5-72B-it (LLM) 34.0 6.7 Imported 2026-05-27
37 Microsoft-Phi-4 (LLM) 32.0 6.6 Imported 2026-05-27
38 Qwen2-VL-72B-it (LLM) 24.0 6.0 Imported 2026-05-27
39 Llama-3.2-3B-it (LLM) 20.0 5.7 Imported 2026-05-27
40 Llama-3.2-11B-it (VLM) 18.0 5.4 Imported 2026-05-27
41 Qwen-2.5-7B-it (LLM) 14.0 4.9 Imported 2026-05-27
42 Llama-3.2-1B-it (LLM) 8.0 3.8 Imported 2026-05-27
43 Qwen2-VL-7B-it (LLM) 4.0 2.8 Imported 2026-05-27
44 Qwen2-VL-7B-it (VLM) 2.0 2.0 Imported 2026-05-27