BabyAI
BabyAI: Measures embodied-agent, navigation, manipulation, or simulated robotics task success.
44rows
babyai_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
BabyAI score, BALROG progress, Crafter score, TextWorld score, BabaIsAI score, MiniHack score, NetHack score
| Rank | Subject | BabyAI score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini-3.1-Pro (LLM) | 100.0 0.0 | — | Imported | 2026-05-27 |
| 2 | Gemini-3.1-Pro-Thinking (LLM) | 98.0 2.0 | — | Imported | 2026-05-27 |
| 3 | Gemini-3-Pro (LLM) | 96.0 2.8 | — | Imported | 2026-05-27 |
| 4 | Gemini-3-Flash (LLM) | 86.0 4.9 | — | Imported | 2026-05-27 |
| 5 | Claude-3.5-Sonnet-2024-10-22 (VLM) | 82.0 5.4 | — | Imported | 2026-05-27 |
| 6 | Claude-Opus-4.5 (LLM) | 80.0 5.7 | — | Imported | 2026-05-27 |
| 7 | Gemini-2.5-Pro-Exp-03-25 (LLM) | 80.0 5.7 | — | Imported | 2026-05-27 |
| 8 | GPT-5-minimal-think (LLM) | 80.0 5.7 | — | Imported | 2026-05-27 |
| 9 | GPT-4o-2024-05-13 (LLM) | 77.6 3.7 | — | Imported | 2026-05-27 |
| 10 | Grok-4 (LLM) | 76.0 6.0 | — | Imported | 2026-05-27 |
| 11 | Reka-Flash-3 (LLM) | 76.0 6.0 | — | Imported | 2026-05-27 |
| 12 | DeepSeek-R1 (LLM) | 74.0 6.2 | — | Imported | 2026-05-27 |
| 13 | Gemini-2.5-Pro-Exp-03-25 (VLM) | 74.0 6.2 | — | Imported | 2026-05-27 |
| 14 | Llama-3.1-70B-it (LLM) | 73.2 4.0 | — | Imported | 2026-05-27 |
| 15 | Claude-Opus-4.5-Thinking (LLM) | 72.0 6.3 | — | Imported | 2026-05-27 |
| 16 | Llama-3.2-90B-it (LLM) | 72.0 6.3 | — | Imported | 2026-05-27 |
| 17 | Claude-3.5-Sonnet-2024-10-22 (LLM) | 68.0 6.6 | — | Imported | 2026-05-27 |
| 18 | Gemini-2.5-Flash (LLM) | 68.0 6.6 | — | Imported | 2026-05-27 |
| 19 | Claude-Haiku-4.5 (LLM) | 66.0 6.7 | — | Imported | 2026-05-27 |
| 20 | Llama-3.2-90B-it (VLM) | 66.0 6.7 | — | Imported | 2026-05-27 |
| 21 | Llama-3.3-70B-it (LLM) | 66.0 6.7 | — | Imported | 2026-05-27 |
| 22 | GPT-4o-2024-05-13 (VLM) | 62.0 4.3 | — | Imported | 2026-05-27 |
| 23 | Grok-3-beta (LLM) | 62.0 6.9 | — | Imported | 2026-05-27 |
| 24 | Gemini-1.5-Pro-002 (LLM) | 58.4 4.4 | — | Imported | 2026-05-27 |
| 25 | Gemini-1.5-Pro-002 (VLM) | 58.4 4.4 | — | Imported | 2026-05-27 |
| 26 | Claude-3.5-Haiku-2024-10-22 (LLM) | 52.0 7.1 | — | Imported | 2026-05-27 |
| 27 | GPT-4o-mini-2024-07-18 (LLM) | 50.4 4.5 | — | Imported | 2026-05-27 |
| 28 | Gemini-1.5-Flash-002 (LLM) | 50.0 7.1 | — | Imported | 2026-05-27 |
| 29 | Llama-3.2-11B-it (LLM) | 50.0 7.1 | — | Imported | 2026-05-27 |
| 30 | Mistral-Nemo-it-2407 (LLM) | 50.0 7.1 | — | Imported | 2026-05-27 |
| 31 | DeepSeek-R1-Distill-Qwen-32B (LLM) | 48.0 7.1 | — | Imported | 2026-05-27 |
| 32 | Gemini-1.5-Flash-002 (VLM) | 43.2 4.4 | — | Imported | 2026-05-27 |
| 33 | GPT-4o-mini-2024-07-18 (VLM) | 38.0 4.3 | — | Imported | 2026-05-27 |
| 34 | Llama-3.1-8B-it (LLM) | 36.0 6.8 | — | Imported | 2026-05-27 |
| 35 | Qwen2-VL-72B-it (VLM) | 34.0 6.7 | — | Imported | 2026-05-27 |
| 36 | Qwen2.5-72B-it (LLM) | 34.0 6.7 | — | Imported | 2026-05-27 |
| 37 | Microsoft-Phi-4 (LLM) | 32.0 6.6 | — | Imported | 2026-05-27 |
| 38 | Qwen2-VL-72B-it (LLM) | 24.0 6.0 | — | Imported | 2026-05-27 |
| 39 | Llama-3.2-3B-it (LLM) | 20.0 5.7 | — | Imported | 2026-05-27 |
| 40 | Llama-3.2-11B-it (VLM) | 18.0 5.4 | — | Imported | 2026-05-27 |
| 41 | Qwen-2.5-7B-it (LLM) | 14.0 4.9 | — | Imported | 2026-05-27 |
| 42 | Llama-3.2-1B-it (LLM) | 8.0 3.8 | — | Imported | 2026-05-27 |
| 43 | Qwen2-VL-7B-it (LLM) | 4.0 2.8 | — | Imported | 2026-05-27 |
| 44 | Qwen2-VL-7B-it (VLM) | 2.0 2.0 | — | Imported | 2026-05-27 |
No matching rows.