LLM-WikiRace
Benchmark for long-term planning and reasoning over real-world knowledge graphs where models navigate Wikipedia hyperlinks from a source page to a target page.
29rows
medium_successprimary metric
2026-05-06sampled
Metadata
Metrics
Easy Success, Medium Success, Hard Success, Tokens / Step (lower is better)
| Rank | Subject | Medium Success | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 | 69.80 | — | Imported | 2026-05-06 |
| 2 | Gemini 3 | 66 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 3 | GPT-5 | 60 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 4 | Claude Opus 4.6 | 56.70 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 5 | Gemini 2.5 | 56.70 | — | Imported | 2026-05-06 |
| 6 | Claude Opus 4.5 | 56 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 7 | DeepSeek R1 | 54.70 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 8 | Gemini 2.5 Flash | 53 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 9 | GPT-5.2 | 50.70 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 10 | GPT-5 Mini | 46 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 11 | Kimi K2 | 45.30 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-05-06 |
| 12 | Grok 4.1-Fast | 44.70 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-06 |
| 13 | Claude Sonnet 4.5 | 43.30 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 14 | Gemini 2.0 Flash | 41.30 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-06 |
| 15 | LLaMA 3 70B | 39.30 | — | Imported | 2026-05-06 |
| 16 | Gemma 3 27B | 30 | Gemma 3 27B google-gemma-3-27b-it | Imported | 2026-05-06 |
| 17 | GPT-5 Nano | 24.70 | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-06 |
| 18 | Gemma 3 12B | 22.70 | Gemma 3 12B google-gemma-3-12b-it | Imported | 2026-05-06 |
| 19 | Apertus 70B | 10.70 | — | Imported | 2026-05-06 |
| 20 | Mistral 7B | 10 | — | Imported | 2026-05-06 |
| 21 | LLaMA 3 8B | 9.30 | — | Imported | 2026-05-06 |
| 22 | Ministral 8B | 8.70 | — | Imported | 2026-05-06 |
| 23 | LLaDA-Inst. 8B | 4.70 | — | Imported | 2026-05-06 |
| 24 | Apertus 8B | 4 | — | Imported | 2026-05-06 |
| 25 | Dream-v0-Inst. 7B | 3.30 | — | Imported | 2026-05-06 |
| 26 | LLaMA 3 3B | 3.30 | — | Imported | 2026-05-06 |
| 27 | Gemma 3 4B | 2.70 | Gemma 3 4B google-gemma-3-4b-it | Imported | 2026-05-06 |
| 28 | Qwen 2.5-7B | 1.30 | — | Imported | 2026-05-06 |
| 29 | LLaMA 3 1B | 0 | — | Imported | 2026-05-06 |
No matching rows.