BBH

Big-Bench Hard benchmark with challenging tasks requiring multi-step reasoning.

16rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 DeepSeek V3 87.50 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
2 Llama 3.1 405B 82.90 Imported 2026-05-06
3 Phi-3-medium-128k-instruct 81.40 Imported 2026-05-06
4 Qwen 2.5 72B 79.80 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
5 Phi-3-small-8k-instruct 79.10 Imported 2026-05-06
6 GPT-4.1 75.12 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
7 Phi-3-mini-4k-instruct 71.70 Imported 2026-05-06
8 Llama-2-70b-hf 64.90 Imported 2026-05-06
9 gpt-3.5-turbo-1106 61.59 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
10 Phi-4 59.40 Phi 4
microsoft-phi-4
Imported 2026-05-06
11 Llama-2-7b 58.50 Imported 2026-05-06
12 Mistral-7B-v0.1 56.10 Imported 2026-05-06
13 gemma-7b 55.10 Imported 2026-05-06
14 Qwen 3 235B 55 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
15 Yi-6B 47.20 Imported 2026-05-06
16 falcon-180B 37.10 Imported 2026-05-06