HELM MMLU

HELM Classic MMLU group leaderboard from the public HELM release artifacts.

67rows
exact_matchprimary metric
2026-05-06sampled

Metadata

Metrics

Exact match, ECE (10-bin) (lower is better), Exact match (robustness), Exact match (fairness), Denoised inference time (s) (lower is better), # eval

Latest Results

Rows are ranked by exact match from the aggregate HELM group table.

Rank Subject Exact match Model Match Provenance Sampled
1 Palmyra X (43B) 60.91 Imported 2026-05-06
2 gpt-3.5-turbo-0301 58.98 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
3 LLaMA (65B) 58.37 Imported 2026-05-06
4 Llama 2 (70B) 58.17 Imported 2026-05-06
5 Mistral v0.1 (7B) 57.22 Imported 2026-05-06
6 text-davinci-003 56.88 Imported 2026-05-06
7 text-davinci-002 56.76 Imported 2026-05-06
8 LLaMA (30B) 53.14 Imported 2026-05-06
9 Falcon (40B) 50.89 Imported 2026-05-06
10 Llama 2 (13B) 50.67 Imported 2026-05-06
11 Falcon-Instruct (40B) 49.66 Imported 2026-05-06
12 Anthropic-LM v4-s3 (52B) 48.13 Imported 2026-05-06
13 Jurassic-2 Jumbo (178B) 48.05 Imported 2026-05-06
14 Jurassic-2 Grande (17B) 47.53 Imported 2026-05-06
15 TNLG v2 (530B) 46.90 Imported 2026-05-06
16 Vicuna v1.3 (13B) 46.16 Imported 2026-05-06
17 Cohere Command beta (52.4B) 45.24 Imported 2026-05-06
18 J1-Grande v2 beta (17B) 44.51 Imported 2026-05-06
19 MPT-Instruct (30B) 44.44 Imported 2026-05-06
20 MPT (30B) 43.67 Imported 2026-05-06
21 Vicuna v1.3 (7B) 43.36 Imported 2026-05-06
22 Llama 2 (7B) 43.07 Imported 2026-05-06
23 davinci (175B) 42.24 Imported 2026-05-06
24 LLaMA (13B) 42.21 Imported 2026-05-06
25 T0pp (11B) 40.65 Imported 2026-05-06
26 Cohere Command beta (6.1B) 40.63 Imported 2026-05-06
27 InstructPalmyra (30B) 40.27 Imported 2026-05-06
28 gpt-3.5-turbo-0613 39.09 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
29 Alpaca (7B) 38.46 Imported 2026-05-06
30 Cohere xlarge v20221108 (52.4B) 38.20 Imported 2026-05-06
31 Luminous Supreme (70B) 38.01 Imported 2026-05-06
32 RedPajama-INCITE-Instruct (7B) 36.31 Imported 2026-05-06
33 Cohere xlarge v20220609 (52.4B) 35.30 Imported 2026-05-06
34 GLM (130B) 34.40 Imported 2026-05-06
35 Jurassic-2 Large (7.5B) 33.85 Imported 2026-05-06
36 Cohere large v20220720 (13.1B) 32.36 Imported 2026-05-06
37 Luminous Extended (30B) 32.07 Imported 2026-05-06
38 LLaMA (7B) 32.06 Imported 2026-05-06
39 OPT (175B) 31.84 Imported 2026-05-06
40 RedPajama-INCITE-Base (7B) 30.16 Imported 2026-05-06
41 BLOOM (176B) 29.87 Imported 2026-05-06
42 UL2 (20B) 29.12 Imported 2026-05-06
43 T5 (11B) 29.03 Imported 2026-05-06
44 Falcon (7B) 28.64 Imported 2026-05-06
45 Cohere medium v20220720 (6.1B) 27.88 Imported 2026-05-06
46 GPT-NeoX (20B) 27.64 Imported 2026-05-06
47 OPT (66B) 27.60 Imported 2026-05-06
48 Falcon-Instruct (7B) 27.49 Imported 2026-05-06
49 Pythia (12B) 27.36 Imported 2026-05-06
50 J1-Grande v1 (17B) 26.98 Imported 2026-05-06
51 Luminous Base (13B) 26.97 Imported 2026-05-06
52 Cohere small v20220720 (410M) 26.42 Imported 2026-05-06
53 RedPajama-INCITE-Base-v1 (3B) 26.29 Imported 2026-05-06
54 J1-Jumbo v1 (178B) 25.94 Imported 2026-05-06
55 RedPajama-INCITE-Instruct-v1 (3B) 25.74 Imported 2026-05-06
56 Cohere medium v20221108 (6.1B) 25.37 Imported 2026-05-06
57 GPT-J (6B) 24.85 Imported 2026-05-06
58 YaLM (100B) 24.34 Imported 2026-05-06
59 curie (6.7B) 24.28 Imported 2026-05-06
60 ada (350M) 24.28 Imported 2026-05-06
61 TNLG v2 (6.7B) 24.18 Imported 2026-05-06
62 J1-Large v1 (7.5B) 24.11 Imported 2026-05-06
63 text-ada-001 23.77 Imported 2026-05-06
64 text-curie-001 23.72 Imported 2026-05-06
65 Pythia (6.9B) 23.61 Imported 2026-05-06
66 babbage (1.3B) 23.45 Imported 2026-05-06
67 text-babbage-001 22.87 Imported 2026-05-06