Balrog

Balanced reasoning and logic games evaluation benchmark.

16rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 Grok 4 43.60 GROK Grok 4
x-ai-grok-4
Imported 2026-05-06
2 Gemini 2.5 Pro (Jun 2025) 43.30 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
3 DeepSeek R1 34.90 R1
deepseek-r1
Imported 2026-05-06
4 GPT-5.2 32.80 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
5 Claude 3.7 Sonnet 32.60 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
6 GPT-4o 32.30 GPT-4o
openai-gpt-4o
Imported 2026-05-06
7 Grok-3 mini 29.50 GROK Grok 3 Mini
x-ai-grok-3-mini
Imported 2026-05-06
8 Llama 3.1 405B 27.90 Imported 2026-05-06
9 Llama 3.3 70B 23 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
10 Gemini 1.5 Flash 21 Imported 2026-05-06
11 DeepSeek V3 19.50 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
12 Claude 3.5 Haiku 19.30 Claude 3.5 Haiku
anthropic-claude-3.5-haiku
Imported 2026-05-06
13 Mistral Large 17.60 Mistral Large
mistralai-mistral-large
Imported 2026-05-06
14 gpt-4o-mini-2024-07-18 17.40 GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-05-06
15 Qwen2.5-Max 16.20 Imported 2026-05-06
16 Phi-4 11.60 Phi 4
microsoft-phi-4
Imported 2026-05-06