HELM

HELM: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

67rows
mean_win_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Mean win rate, MMLU Exact Match, BoolQ Exact Match, NarrativeQA F1, NaturalQuestions closed-book F1, NaturalQuestions open-book F1

Latest Results

Rows are ranked by HELM Classic mean win rate from the aggregate core scenarios group table.

Rank Subject Mean win rate Model Match Provenance Sampled
1 Llama 2 (70B) 94.351981% Imported 2026-05-27
2 LLaMA (65B) 90.825175% Imported 2026-05-27
3 text-davinci-002 90.502788% Imported 2026-05-27
4 Mistral v0.1 (7B) 88.403263% Imported 2026-05-27
5 Cohere Command beta (52.4B) 87.449068% Imported 2026-05-27
6 text-davinci-003 87.159957% Imported 2026-05-27
7 Jurassic-2 Jumbo (178B) 82.438267% Imported 2026-05-27
8 Llama 2 (13B) 82.300699% Imported 2026-05-27
9 TNLG v2 (530B) 78.65258% Imported 2026-05-27
10 gpt-3.5-turbo-0613 78.296037% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
11 LLaMA (30B) 78.128205% Imported 2026-05-27
12 Anthropic-LM v4-s3 (52B) 78.037742% Imported 2026-05-27
13 gpt-3.5-turbo-0301 76.025641% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
14 Jurassic-2 Grande (17B) 74.324689% Imported 2026-05-27
15 Palmyra X (43B) 73.246451% Imported 2026-05-27
16 Falcon (40B) 72.939394% Imported 2026-05-27
17 Falcon-Instruct (40B) 72.655012% Imported 2026-05-27
18 MPT-Instruct (30B) 71.638695% Imported 2026-05-27
19 MPT (30B) 71.449883% Imported 2026-05-27
20 J1-Grande v2 beta (17B) 70.639592% Imported 2026-05-27
21 Vicuna v1.3 (13B) 70.631702% Imported 2026-05-27
22 Cohere Command beta (6.1B) 67.521002% Imported 2026-05-27
23 Cohere xlarge v20221108 (52.4B) 66.395092% Imported 2026-05-27
24 Luminous Supreme (70B) 66.159188% Imported 2026-05-27
25 Vicuna v1.3 (7B) 62.526807% Imported 2026-05-27
26 OPT (175B) 60.945576% Imported 2026-05-27
27 Llama 2 (7B) 60.731935% Imported 2026-05-27
28 LLaMA (13B) 59.468531% Imported 2026-05-27
29 InstructPalmyra (30B) 56.845377% Imported 2026-05-27
30 Cohere xlarge v20220609 (52.4B) 55.954648% Imported 2026-05-27
31 Jurassic-2 Large (7.5B) 55.296191% Imported 2026-05-27
32 davinci (175B) 53.770035% Imported 2026-05-27
33 LLaMA (7B) 53.268065% Imported 2026-05-27
34 RedPajama-INCITE-Instruct (7B) 52.424242% Imported 2026-05-27
35 J1-Jumbo v1 (178B) 51.650298% Imported 2026-05-27
36 GLM (130B) 51.212121% Imported 2026-05-27
37 Luminous Extended (30B) 48.50136% Imported 2026-05-27
38 OPT (66B) 44.801021% Imported 2026-05-27
39 BLOOM (176B) 44.607322% Imported 2026-05-27
40 J1-Grande v1 (17B) 43.263522% Imported 2026-05-27
41 Alpaca (7B) 38.088578% Imported 2026-05-27
42 Falcon (7B) 37.834499% Imported 2026-05-27
43 RedPajama-INCITE-Base (7B) 37.806527% Imported 2026-05-27
44 Cohere large v20220720 (13.1B) 37.183516% Imported 2026-05-27
45 RedPajama-INCITE-Instruct-v1 (3B) 36.608392% Imported 2026-05-27
46 text-curie-001 35.974581% Imported 2026-05-27
47 GPT-NeoX (20B) 35.097193% Imported 2026-05-27
48 Luminous Base (13B) 31.543318% Imported 2026-05-27
49 Cohere medium v20221108 (6.1B) 31.207175% Imported 2026-05-27
50 RedPajama-INCITE-Base-v1 (3B) 31.081585% Imported 2026-05-27
51 TNLG v2 (6.7B) 30.921482% Imported 2026-05-27
52 J1-Large v1 (7.5B) 28.522344% Imported 2026-05-27
53 GPT-J (6B) 27.275385% Imported 2026-05-27
54 Pythia (12B) 25.678322% Imported 2026-05-27
55 curie (6.7B) 24.737934% Imported 2026-05-27
56 Falcon-Instruct (7B) 24.405594% Imported 2026-05-27
57 Cohere medium v20220720 (6.1B) 22.967173% Imported 2026-05-27
58 text-babbage-001 22.864976% Imported 2026-05-27
59 T0pp (11B) 19.708625% Imported 2026-05-27
60 Pythia (6.9B) 19.559441% Imported 2026-05-27
61 UL2 (20B) 16.721542% Imported 2026-05-27
62 T5 (11B) 13.136169% Imported 2026-05-27
63 babbage (1.3B) 11.400428% Imported 2026-05-27
64 Cohere small v20220720 (410M) 10.872046% Imported 2026-05-27
65 ada (350M) 10.83284% Imported 2026-05-27
66 text-ada-001 10.733701% Imported 2026-05-27
67 YaLM (100B) 7.453866% Imported 2026-05-27