BIG-Bench Extra Hard

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

9rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: big_bench_extra_hard
Category: Reasoning
Release: 2025-02-26
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Gemma 4 31B	0.74	Gemma 4 31B google-gemma-4-31b-it	Self-reported	2026-05-06
2	Gemma 4 26B-A4B	0.65	Gemma 4 26B A4B google-gemma-4-26b-a4b-it	Self-reported	2026-05-06
3	Gemma 4 E4B	0.33	—	Self-reported	2026-05-06
4	Gemma 4 E2B	0.22	—	Self-reported	2026-05-06
5	Gemma 3 27B	0.19	Gemma 3 27B google-gemma-3-27b-it	Self-reported	2026-05-06
6	Gemma 3 12B	0.16	Gemma 3 12B google-gemma-3-12b-it	Self-reported	2026-05-06
7	Gemini Diffusion	0.15	—	Self-reported	2026-05-06
8	Gemma 3 4B	0.11	Gemma 3 4B google-gemma-3-4b-it	Self-reported	2026-05-06
9	Gemma 3 1B	0.07	—	Self-reported	2026-05-06

Metadata

Metrics

Latest Results