HiddenMath

Google DeepMind's internal mathematical reasoning benchmark that introduces novel problems not encountered during model training to evaluate true mathematical reasoning capabilities rather than memorization

13rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Gemini 2.0 Flash 0.63 Gemini 2.0 Flash
google-gemini-2.0-flash
Self-reported 2026-05-06
2 Gemma 3 27B 0.60 Gemma 3 27B
google-gemma-3-27b-it
Self-reported 2026-05-06
3 Gemini 2.0 Flash-Lite 0.55 Gemini 2.0 Flash Lite
google-gemini-2.0-flash-lite-001
Self-reported 2026-05-06
4 Gemma 3 12B 0.55 Gemma 3 12B
google-gemma-3-12b-it
Self-reported 2026-05-06
5 Gemini 1.5 Pro 0.52 Self-reported 2026-05-06
6 Gemini 1.5 Flash 0.47 Self-reported 2026-05-06
7 Gemma 3 4B 0.43 Gemma 3 4B
google-gemma-3-4b-it
Self-reported 2026-05-06
8 Gemma 3n E4B Instructed LiteRT Preview 0.38 Gemma 3n 4B
google-gemma-3n-e4b-it
Self-reported 2026-05-06
8 Gemma 3n E4B Instructed 0.38 Gemma 3n 4B
google-gemma-3n-e4b-it
Self-reported 2026-05-06
10 Gemini 1.5 Flash 8B 0.33 Self-reported 2026-05-06
11 Gemma 3n E2B Instructed 0.28 Gemma 3n 2B
google-gemma-3n-e2b-it
Self-reported 2026-05-06
11 Gemma 3n E2B Instructed LiteRT (Preview) 0.28 Gemma 3n 2B
google-gemma-3n-e2b-it
Self-reported 2026-05-06
13 Gemma 3 1B 0.16 Self-reported 2026-05-06