HELM GSM8K
HELM GSM8K: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
69rows
exact_matchprimary metric
2026-05-27sampled
Metadata
Metrics
Exact match, ECE (10-bin) (lower is better), Exact match (Robustness), Exact match (Fairness), Denoised inference time (s) (lower is better), # eval
| Rank | Subject | Exact match | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Palmyra X (43B) | 63.3% | — | Imported | 2026-05-27 |
| 2 | code-davinci-002 | 56.766667% | — | Imported | 2026-05-27 |
| 3 | gpt-3.5-turbo-0301 | 53.1% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 4 | text-davinci-003 | 50.6% | — | Imported | 2026-05-27 |
| 5 | Llama 2 (70B) | 48.4% | — | Imported | 2026-05-27 |
| 6 | gpt-3.5-turbo-0613 | 46.9% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 7 | LLaMA (65B) | 46.6% | — | Imported | 2026-05-27 |
| 8 | text-davinci-002 | 41.533333% | — | Imported | 2026-05-27 |
| 9 | Mistral v0.1 (7B) | 38.1% | — | Imported | 2026-05-27 |
| 10 | MPT-Instruct (30B) | 34.4% | — | Imported | 2026-05-27 |
| 11 | Falcon-Instruct (40B) | 33.8% | — | Imported | 2026-05-27 |
| 12 | LLaMA (30B) | 32% | — | Imported | 2026-05-27 |
| 13 | Falcon (40B) | 25% | — | Imported | 2026-05-27 |
| 14 | Llama 2 (13B) | 24.5% | — | Imported | 2026-05-27 |
| 15 | Vicuna v1.3 (13B) | 22.6% | — | Imported | 2026-05-27 |
| 16 | Jurassic-2 Jumbo (178B) | 22.466667% | — | Imported | 2026-05-27 |
| 17 | Anthropic-LM v4-s3 (52B) | 17.1% | — | Imported | 2026-05-27 |
| 18 | MPT (30B) | 16.4% | — | Imported | 2026-05-27 |
| 19 | LLaMA (13B) | 15.4% | — | Imported | 2026-05-27 |
| 20 | TNLG v2 (530B) | 14.633333% | — | Imported | 2026-05-27 |
| 21 | Cohere Command beta (52.4B) | 13.8% | — | Imported | 2026-05-27 |
| 22 | Vicuna v1.3 (7B) | 13.4% | — | Imported | 2026-05-27 |
| 23 | Jurassic-2 Grande (17B) | 13.3% | — | Imported | 2026-05-27 |
| 24 | Llama 2 (7B) | 13.3% | — | Imported | 2026-05-27 |
| 25 | Luminous Supreme (70B) | 11.2% | — | Imported | 2026-05-27 |
| 26 | Cohere xlarge v20221108 (52.4B) | 9.966667% | — | Imported | 2026-05-27 |
| 27 | J1-Grande v2 beta (17B) | 9.6% | — | Imported | 2026-05-27 |
| 28 | BLOOM (176B) | 9.5% | — | Imported | 2026-05-27 |
| 29 | davinci (175B) | 9% | — | Imported | 2026-05-27 |
| 30 | LLaMA (7B) | 8% | — | Imported | 2026-05-27 |
| 31 | Cohere xlarge v20220609 (52.4B) | 7% | — | Imported | 2026-05-27 |
| 32 | Luminous Extended (30B) | 6.666667% | — | Imported | 2026-05-27 |
| 33 | InstructPalmyra (30B) | 6.333333% | — | Imported | 2026-05-27 |
| 34 | GLM (130B) | 6.1% | — | Imported | 2026-05-27 |
| 35 | J1-Jumbo v1 (178B) | 5.4% | — | Imported | 2026-05-27 |
| 36 | J1-Grande v1 (17B) | 5.366667% | — | Imported | 2026-05-27 |
| 37 | GPT-NeoX (20B) | 5.266667% | — | Imported | 2026-05-27 |
| 38 | Falcon-Instruct (7B) | 5.2% | — | Imported | 2026-05-27 |
| 39 | code-cushman-001 (12B) | 4.9% | — | Imported | 2026-05-27 |
| 40 | OPT (175B) | 4.033333% | — | Imported | 2026-05-27 |
| 41 | Falcon (7B) | 4% | — | Imported | 2026-05-27 |
| 42 | Cohere Command beta (6.1B) | 3.6% | — | Imported | 2026-05-27 |
| 43 | GPT-J (6B) | 3.6% | — | Imported | 2026-05-27 |
| 44 | Pythia (12B) | 3.2% | — | Imported | 2026-05-27 |
| 45 | Jurassic-2 Large (7.5B) | 3% | — | Imported | 2026-05-27 |
| 46 | Luminous Base (13B) | 2.6% | — | Imported | 2026-05-27 |
| 47 | UL2 (20B) | 2.366667% | — | Imported | 2026-05-27 |
| 48 | T5 (11B) | 2.333333% | — | Imported | 2026-05-27 |
| 49 | RedPajama-INCITE-Base (7B) | 2.1% | — | Imported | 2026-05-27 |
| 50 | OPT (66B) | 1.8% | — | Imported | 2026-05-27 |
| 51 | TNLG v2 (6.7B) | 1.8% | — | Imported | 2026-05-27 |
| 52 | Cohere large v20220720 (13.1B) | 1.766667% | — | Imported | 2026-05-27 |
| 53 | Cohere medium v20221108 (6.1B) | 1.733333% | — | Imported | 2026-05-27 |
| 54 | RedPajama-INCITE-Instruct (7B) | 1.6% | — | Imported | 2026-05-27 |
| 55 | curie (6.7B) | 1.566667% | — | Imported | 2026-05-27 |
| 56 | Cohere medium v20220720 (6.1B) | 1.466667% | — | Imported | 2026-05-27 |
| 57 | Pythia (6.9B) | 1.4% | — | Imported | 2026-05-27 |
| 58 | J1-Large v1 (7.5B) | 1.366667% | — | Imported | 2026-05-27 |
| 59 | Alpaca (7B) | 1.2% | — | Imported | 2026-05-27 |
| 60 | RedPajama-INCITE-Instruct-v1 (3B) | 1.1% | — | Imported | 2026-05-27 |
| 61 | RedPajama-INCITE-Base-v1 (3B) | 1% | — | Imported | 2026-05-27 |
| 62 | babbage (1.3B) | 0.666667% | — | Imported | 2026-05-27 |
| 63 | ada (350M) | 0.633333% | — | Imported | 2026-05-27 |
| 64 | text-curie-001 | 0.6% | — | Imported | 2026-05-27 |
| 65 | Cohere small v20220720 (410M) | 0.4% | — | Imported | 2026-05-27 |
| 66 | text-ada-001 | 0.4% | — | Imported | 2026-05-27 |
| 67 | text-babbage-001 | 0.033333% | — | Imported | 2026-05-27 |
| 68 | T0pp (11B) | 0% | — | Imported | 2026-05-27 |
| 69 | YaLM (100B) | 0% | — | Imported | 2026-05-27 |
No matching rows.