BIG-Bench Extra Hard

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

9rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Gemma 4 31B 0.74 Gemma 4 31B
google-gemma-4-31b-it
Self-reported 2026-05-06
2 Gemma 4 26B-A4B 0.65 Gemma 4 26B A4B
google-gemma-4-26b-a4b-it
Self-reported 2026-05-06
3 Gemma 4 E4B 0.33 Self-reported 2026-05-06
4 Gemma 4 E2B 0.22 Self-reported 2026-05-06
5 Gemma 3 27B 0.19 Gemma 3 27B
google-gemma-3-27b-it
Self-reported 2026-05-06
6 Gemma 3 12B 0.16 Gemma 3 12B
google-gemma-3-12b-it
Self-reported 2026-05-06
7 Gemini Diffusion 0.15 Self-reported 2026-05-06
8 Gemma 3 4B 0.11 Gemma 3 4B
google-gemma-3-4b-it
Self-reported 2026-05-06
9 Gemma 3 1B 0.07 Self-reported 2026-05-06