HELM GSM8K

HELM GSM8K: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.

69rows
exact_matchprimary metric
2026-05-27sampled

Metadata

Metrics

Exact match, ECE (10-bin) (lower is better), Exact match (Robustness), Exact match (Fairness), Denoised inference time (s) (lower is better), # eval

Latest Results

Rows are ranked by Exact match from the aggregate HELM Classic gsm group table.

Rank Subject Exact match Model Match Provenance Sampled
1 Palmyra X (43B) 63.3% Imported 2026-05-27
2 code-davinci-002 56.766667% Imported 2026-05-27
3 gpt-3.5-turbo-0301 53.1% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
4 text-davinci-003 50.6% Imported 2026-05-27
5 Llama 2 (70B) 48.4% Imported 2026-05-27
6 gpt-3.5-turbo-0613 46.9% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
7 LLaMA (65B) 46.6% Imported 2026-05-27
8 text-davinci-002 41.533333% Imported 2026-05-27
9 Mistral v0.1 (7B) 38.1% Imported 2026-05-27
10 MPT-Instruct (30B) 34.4% Imported 2026-05-27
11 Falcon-Instruct (40B) 33.8% Imported 2026-05-27
12 LLaMA (30B) 32% Imported 2026-05-27
13 Falcon (40B) 25% Imported 2026-05-27
14 Llama 2 (13B) 24.5% Imported 2026-05-27
15 Vicuna v1.3 (13B) 22.6% Imported 2026-05-27
16 Jurassic-2 Jumbo (178B) 22.466667% Imported 2026-05-27
17 Anthropic-LM v4-s3 (52B) 17.1% Imported 2026-05-27
18 MPT (30B) 16.4% Imported 2026-05-27
19 LLaMA (13B) 15.4% Imported 2026-05-27
20 TNLG v2 (530B) 14.633333% Imported 2026-05-27
21 Cohere Command beta (52.4B) 13.8% Imported 2026-05-27
22 Vicuna v1.3 (7B) 13.4% Imported 2026-05-27
23 Jurassic-2 Grande (17B) 13.3% Imported 2026-05-27
24 Llama 2 (7B) 13.3% Imported 2026-05-27
25 Luminous Supreme (70B) 11.2% Imported 2026-05-27
26 Cohere xlarge v20221108 (52.4B) 9.966667% Imported 2026-05-27
27 J1-Grande v2 beta (17B) 9.6% Imported 2026-05-27
28 BLOOM (176B) 9.5% Imported 2026-05-27
29 davinci (175B) 9% Imported 2026-05-27
30 LLaMA (7B) 8% Imported 2026-05-27
31 Cohere xlarge v20220609 (52.4B) 7% Imported 2026-05-27
32 Luminous Extended (30B) 6.666667% Imported 2026-05-27
33 InstructPalmyra (30B) 6.333333% Imported 2026-05-27
34 GLM (130B) 6.1% Imported 2026-05-27
35 J1-Jumbo v1 (178B) 5.4% Imported 2026-05-27
36 J1-Grande v1 (17B) 5.366667% Imported 2026-05-27
37 GPT-NeoX (20B) 5.266667% Imported 2026-05-27
38 Falcon-Instruct (7B) 5.2% Imported 2026-05-27
39 code-cushman-001 (12B) 4.9% Imported 2026-05-27
40 OPT (175B) 4.033333% Imported 2026-05-27
41 Falcon (7B) 4% Imported 2026-05-27
42 Cohere Command beta (6.1B) 3.6% Imported 2026-05-27
43 GPT-J (6B) 3.6% Imported 2026-05-27
44 Pythia (12B) 3.2% Imported 2026-05-27
45 Jurassic-2 Large (7.5B) 3% Imported 2026-05-27
46 Luminous Base (13B) 2.6% Imported 2026-05-27
47 UL2 (20B) 2.366667% Imported 2026-05-27
48 T5 (11B) 2.333333% Imported 2026-05-27
49 RedPajama-INCITE-Base (7B) 2.1% Imported 2026-05-27
50 OPT (66B) 1.8% Imported 2026-05-27
51 TNLG v2 (6.7B) 1.8% Imported 2026-05-27
52 Cohere large v20220720 (13.1B) 1.766667% Imported 2026-05-27
53 Cohere medium v20221108 (6.1B) 1.733333% Imported 2026-05-27
54 RedPajama-INCITE-Instruct (7B) 1.6% Imported 2026-05-27
55 curie (6.7B) 1.566667% Imported 2026-05-27
56 Cohere medium v20220720 (6.1B) 1.466667% Imported 2026-05-27
57 Pythia (6.9B) 1.4% Imported 2026-05-27
58 J1-Large v1 (7.5B) 1.366667% Imported 2026-05-27
59 Alpaca (7B) 1.2% Imported 2026-05-27
60 RedPajama-INCITE-Instruct-v1 (3B) 1.1% Imported 2026-05-27
61 RedPajama-INCITE-Base-v1 (3B) 1% Imported 2026-05-27
62 babbage (1.3B) 0.666667% Imported 2026-05-27
63 ada (350M) 0.633333% Imported 2026-05-27
64 text-curie-001 0.6% Imported 2026-05-27
65 Cohere small v20220720 (410M) 0.4% Imported 2026-05-27
66 text-ada-001 0.4% Imported 2026-05-27
67 text-babbage-001 0.033333% Imported 2026-05-27
68 T0pp (11B) 0% Imported 2026-05-27
69 YaLM (100B) 0% Imported 2026-05-27