GSM8K

Grade-school math word-problem benchmark for evaluating multi-step arithmetic and reasoning performance.

25rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

GSM8K score

Latest Results

Rows are ranked by the Hugging Face leaderboard API rank. Model display names are preserved from source modelId values.

Rank Subject GSM8K score Model Match Provenance Sampled
1 XiaomiMiMo/MiMo-V2.5-Pro 99.60 Imported 2026-05-06
2 meta-llama/Llama-3.1-405B 96.80 Imported 2026-05-06
3 ibm-granite/granite-4.1-30b 94.16 Imported 2026-05-06
4 deepseek-ai/DeepSeek-V4-Pro 92.60 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-06
5 ibm-granite/granite-4.1-8b 92.49 Granite 4.1 8B
ibm-granite-granite-4.1-8b
Imported 2026-05-06
6 microsoft/Phi-3-medium-4k-instruct 91 Imported 2026-05-06
7 prism-ml/Ternary-Bonsai-8B-mlx-2bit 91 Imported 2026-05-06
8 prism-ml/Ternary-Bonsai-8B-gguf 91 Imported 2026-05-06
9 prism-ml/Ternary-Bonsai-4B-mlx-2bit 90.50 Imported 2026-05-06
10 prism-ml/Ternary-Bonsai-4B-gguf 90.50 Imported 2026-05-06
11 Qwen/Qwen2-72B 89.50 Imported 2026-05-06
12 deepseek-ai/DeepSeek-V3 89.30 Imported 2026-05-06
13 prism-ml/Bonsai-8B-gguf 88 Imported 2026-05-06
14 prism-ml/Bonsai-8B-mlx-1bit 88 Imported 2026-05-06
15 prism-ml/Bonsai-4B-gguf 87.30 Imported 2026-05-06
16 ibm-granite/granite-4.1-3b 86.88 Imported 2026-05-06
17 microsoft/Phi-3.5-mini-instruct 86.20 Imported 2026-05-06
18 internlm/internlm2_5-7b-chat 86 Imported 2026-05-06
19 microsoft/Phi-3-mini-4k-instruct 85.70 Imported 2026-05-06
20 meta-llama/Llama-3.1-8B-Instruct 84.50 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-06
21 Qwen/Qwen2-7B 79.90 Imported 2026-05-06
22 internlm/internlm2-chat-20b 79.60 Imported 2026-05-06
23 deepseek-ai/DeepSeek-V2 79.20 Imported 2026-05-06
24 prism-ml/Ternary-Bonsai-1.7B-mlx-2bit 74.20 Imported 2026-05-06
25 prism-ml/Ternary-Bonsai-1.7B-gguf 74.20 Imported 2026-05-06