HELM MMLU
HELM Classic MMLU group leaderboard from the public HELM release artifacts.
67rows
exact_matchprimary metric
2026-05-06sampled
Metadata
Metrics
Exact match, ECE (10-bin) (lower is better), Exact match (robustness), Exact match (fairness), Denoised inference time (s) (lower is better), # eval
| Rank | Subject | Exact match | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Palmyra X (43B) | 60.91 | — | Imported | 2026-05-06 |
| 2 | gpt-3.5-turbo-0301 | 58.98 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-06 |
| 3 | LLaMA (65B) | 58.37 | — | Imported | 2026-05-06 |
| 4 | Llama 2 (70B) | 58.17 | — | Imported | 2026-05-06 |
| 5 | Mistral v0.1 (7B) | 57.22 | — | Imported | 2026-05-06 |
| 6 | text-davinci-003 | 56.88 | — | Imported | 2026-05-06 |
| 7 | text-davinci-002 | 56.76 | — | Imported | 2026-05-06 |
| 8 | LLaMA (30B) | 53.14 | — | Imported | 2026-05-06 |
| 9 | Falcon (40B) | 50.89 | — | Imported | 2026-05-06 |
| 10 | Llama 2 (13B) | 50.67 | — | Imported | 2026-05-06 |
| 11 | Falcon-Instruct (40B) | 49.66 | — | Imported | 2026-05-06 |
| 12 | Anthropic-LM v4-s3 (52B) | 48.13 | — | Imported | 2026-05-06 |
| 13 | Jurassic-2 Jumbo (178B) | 48.05 | — | Imported | 2026-05-06 |
| 14 | Jurassic-2 Grande (17B) | 47.53 | — | Imported | 2026-05-06 |
| 15 | TNLG v2 (530B) | 46.90 | — | Imported | 2026-05-06 |
| 16 | Vicuna v1.3 (13B) | 46.16 | — | Imported | 2026-05-06 |
| 17 | Cohere Command beta (52.4B) | 45.24 | — | Imported | 2026-05-06 |
| 18 | J1-Grande v2 beta (17B) | 44.51 | — | Imported | 2026-05-06 |
| 19 | MPT-Instruct (30B) | 44.44 | — | Imported | 2026-05-06 |
| 20 | MPT (30B) | 43.67 | — | Imported | 2026-05-06 |
| 21 | Vicuna v1.3 (7B) | 43.36 | — | Imported | 2026-05-06 |
| 22 | Llama 2 (7B) | 43.07 | — | Imported | 2026-05-06 |
| 23 | davinci (175B) | 42.24 | — | Imported | 2026-05-06 |
| 24 | LLaMA (13B) | 42.21 | — | Imported | 2026-05-06 |
| 25 | T0pp (11B) | 40.65 | — | Imported | 2026-05-06 |
| 26 | Cohere Command beta (6.1B) | 40.63 | — | Imported | 2026-05-06 |
| 27 | InstructPalmyra (30B) | 40.27 | — | Imported | 2026-05-06 |
| 28 | gpt-3.5-turbo-0613 | 39.09 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-06 |
| 29 | Alpaca (7B) | 38.46 | — | Imported | 2026-05-06 |
| 30 | Cohere xlarge v20221108 (52.4B) | 38.20 | — | Imported | 2026-05-06 |
| 31 | Luminous Supreme (70B) | 38.01 | — | Imported | 2026-05-06 |
| 32 | RedPajama-INCITE-Instruct (7B) | 36.31 | — | Imported | 2026-05-06 |
| 33 | Cohere xlarge v20220609 (52.4B) | 35.30 | — | Imported | 2026-05-06 |
| 34 | GLM (130B) | 34.40 | — | Imported | 2026-05-06 |
| 35 | Jurassic-2 Large (7.5B) | 33.85 | — | Imported | 2026-05-06 |
| 36 | Cohere large v20220720 (13.1B) | 32.36 | — | Imported | 2026-05-06 |
| 37 | Luminous Extended (30B) | 32.07 | — | Imported | 2026-05-06 |
| 38 | LLaMA (7B) | 32.06 | — | Imported | 2026-05-06 |
| 39 | OPT (175B) | 31.84 | — | Imported | 2026-05-06 |
| 40 | RedPajama-INCITE-Base (7B) | 30.16 | — | Imported | 2026-05-06 |
| 41 | BLOOM (176B) | 29.87 | — | Imported | 2026-05-06 |
| 42 | UL2 (20B) | 29.12 | — | Imported | 2026-05-06 |
| 43 | T5 (11B) | 29.03 | — | Imported | 2026-05-06 |
| 44 | Falcon (7B) | 28.64 | — | Imported | 2026-05-06 |
| 45 | Cohere medium v20220720 (6.1B) | 27.88 | — | Imported | 2026-05-06 |
| 46 | GPT-NeoX (20B) | 27.64 | — | Imported | 2026-05-06 |
| 47 | OPT (66B) | 27.60 | — | Imported | 2026-05-06 |
| 48 | Falcon-Instruct (7B) | 27.49 | — | Imported | 2026-05-06 |
| 49 | Pythia (12B) | 27.36 | — | Imported | 2026-05-06 |
| 50 | J1-Grande v1 (17B) | 26.98 | — | Imported | 2026-05-06 |
| 51 | Luminous Base (13B) | 26.97 | — | Imported | 2026-05-06 |
| 52 | Cohere small v20220720 (410M) | 26.42 | — | Imported | 2026-05-06 |
| 53 | RedPajama-INCITE-Base-v1 (3B) | 26.29 | — | Imported | 2026-05-06 |
| 54 | J1-Jumbo v1 (178B) | 25.94 | — | Imported | 2026-05-06 |
| 55 | RedPajama-INCITE-Instruct-v1 (3B) | 25.74 | — | Imported | 2026-05-06 |
| 56 | Cohere medium v20221108 (6.1B) | 25.37 | — | Imported | 2026-05-06 |
| 57 | GPT-J (6B) | 24.85 | — | Imported | 2026-05-06 |
| 58 | YaLM (100B) | 24.34 | — | Imported | 2026-05-06 |
| 59 | curie (6.7B) | 24.28 | — | Imported | 2026-05-06 |
| 60 | ada (350M) | 24.28 | — | Imported | 2026-05-06 |
| 61 | TNLG v2 (6.7B) | 24.18 | — | Imported | 2026-05-06 |
| 62 | J1-Large v1 (7.5B) | 24.11 | — | Imported | 2026-05-06 |
| 63 | text-ada-001 | 23.77 | — | Imported | 2026-05-06 |
| 64 | text-curie-001 | 23.72 | — | Imported | 2026-05-06 |
| 65 | Pythia (6.9B) | 23.61 | — | Imported | 2026-05-06 |
| 66 | babbage (1.3B) | 23.45 | — | Imported | 2026-05-06 |
| 67 | text-babbage-001 | 22.87 | — | Imported | 2026-05-06 |
No matching rows.