BIG-Bench Extra Hard
BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.
9rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemma 4 31B | 0.74 | Gemma 4 31B google-gemma-4-31b-it | Self-reported | 2026-05-06 |
| 2 | Gemma 4 26B-A4B | 0.65 | Gemma 4 26B A4B google-gemma-4-26b-a4b-it | Self-reported | 2026-05-06 |
| 3 | Gemma 4 E4B | 0.33 | — | Self-reported | 2026-05-06 |
| 4 | Gemma 4 E2B | 0.22 | — | Self-reported | 2026-05-06 |
| 5 | Gemma 3 27B | 0.19 | Gemma 3 27B google-gemma-3-27b-it | Self-reported | 2026-05-06 |
| 6 | Gemma 3 12B | 0.16 | Gemma 3 12B google-gemma-3-12b-it | Self-reported | 2026-05-06 |
| 7 | Gemini Diffusion | 0.15 | — | Self-reported | 2026-05-06 |
| 8 | Gemma 3 4B | 0.11 | Gemma 3 4B google-gemma-3-4b-it | Self-reported | 2026-05-06 |
| 9 | Gemma 3 1B | 0.07 | — | Self-reported | 2026-05-06 |
No matching rows.