HELM | BenchmarkList

Metadata

ID: helm
Category: Intelligence
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Mean win rate, MMLU Exact Match, BoolQ Exact Match, NarrativeQA F1, NaturalQuestions closed-book F1, NaturalQuestions open-book F1

Rank	Subject	Mean win rate	Model Match	Provenance	Sampled
1	Llama 2 (70B)	94.351981%	—	Imported	2026-05-27
2	LLaMA (65B)	90.825175%	—	Imported	2026-05-27
3	text-davinci-002	90.502788%	—	Imported	2026-05-27
4	Mistral v0.1 (7B)	88.403263%	—	Imported	2026-05-27
5	Cohere Command beta (52.4B)	87.449068%	—	Imported	2026-05-27
6	text-davinci-003	87.159957%	—	Imported	2026-05-27
7	Jurassic-2 Jumbo (178B)	82.438267%	—	Imported	2026-05-27
8	Llama 2 (13B)	82.300699%	—	Imported	2026-05-27
9	TNLG v2 (530B)	78.65258%	—	Imported	2026-05-27
10	gpt-3.5-turbo-0613	78.296037%	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-27
11	LLaMA (30B)	78.128205%	—	Imported	2026-05-27
12	Anthropic-LM v4-s3 (52B)	78.037742%	—	Imported	2026-05-27
13	gpt-3.5-turbo-0301	76.025641%	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-27
14	Jurassic-2 Grande (17B)	74.324689%	—	Imported	2026-05-27
15	Palmyra X (43B)	73.246451%	—	Imported	2026-05-27
16	Falcon (40B)	72.939394%	—	Imported	2026-05-27
17	Falcon-Instruct (40B)	72.655012%	—	Imported	2026-05-27
18	MPT-Instruct (30B)	71.638695%	—	Imported	2026-05-27
19	MPT (30B)	71.449883%	—	Imported	2026-05-27
20	J1-Grande v2 beta (17B)	70.639592%	—	Imported	2026-05-27
21	Vicuna v1.3 (13B)	70.631702%	—	Imported	2026-05-27
22	Cohere Command beta (6.1B)	67.521002%	—	Imported	2026-05-27
23	Cohere xlarge v20221108 (52.4B)	66.395092%	—	Imported	2026-05-27
24	Luminous Supreme (70B)	66.159188%	—	Imported	2026-05-27
25	Vicuna v1.3 (7B)	62.526807%	—	Imported	2026-05-27
26	OPT (175B)	60.945576%	—	Imported	2026-05-27
27	Llama 2 (7B)	60.731935%	—	Imported	2026-05-27
28	LLaMA (13B)	59.468531%	—	Imported	2026-05-27
29	InstructPalmyra (30B)	56.845377%	—	Imported	2026-05-27
30	Cohere xlarge v20220609 (52.4B)	55.954648%	—	Imported	2026-05-27
31	Jurassic-2 Large (7.5B)	55.296191%	—	Imported	2026-05-27
32	davinci (175B)	53.770035%	—	Imported	2026-05-27
33	LLaMA (7B)	53.268065%	—	Imported	2026-05-27
34	RedPajama-INCITE-Instruct (7B)	52.424242%	—	Imported	2026-05-27
35	J1-Jumbo v1 (178B)	51.650298%	—	Imported	2026-05-27
36	GLM (130B)	51.212121%	—	Imported	2026-05-27
37	Luminous Extended (30B)	48.50136%	—	Imported	2026-05-27
38	OPT (66B)	44.801021%	—	Imported	2026-05-27
39	BLOOM (176B)	44.607322%	—	Imported	2026-05-27
40	J1-Grande v1 (17B)	43.263522%	—	Imported	2026-05-27
41	Alpaca (7B)	38.088578%	—	Imported	2026-05-27
42	Falcon (7B)	37.834499%	—	Imported	2026-05-27
43	RedPajama-INCITE-Base (7B)	37.806527%	—	Imported	2026-05-27
44	Cohere large v20220720 (13.1B)	37.183516%	—	Imported	2026-05-27
45	RedPajama-INCITE-Instruct-v1 (3B)	36.608392%	—	Imported	2026-05-27
46	text-curie-001	35.974581%	—	Imported	2026-05-27
47	GPT-NeoX (20B)	35.097193%	—	Imported	2026-05-27
48	Luminous Base (13B)	31.543318%	—	Imported	2026-05-27
49	Cohere medium v20221108 (6.1B)	31.207175%	—	Imported	2026-05-27
50	RedPajama-INCITE-Base-v1 (3B)	31.081585%	—	Imported	2026-05-27
51	TNLG v2 (6.7B)	30.921482%	—	Imported	2026-05-27
52	J1-Large v1 (7.5B)	28.522344%	—	Imported	2026-05-27
53	GPT-J (6B)	27.275385%	—	Imported	2026-05-27
54	Pythia (12B)	25.678322%	—	Imported	2026-05-27
55	curie (6.7B)	24.737934%	—	Imported	2026-05-27
56	Falcon-Instruct (7B)	24.405594%	—	Imported	2026-05-27
57	Cohere medium v20220720 (6.1B)	22.967173%	—	Imported	2026-05-27
58	text-babbage-001	22.864976%	—	Imported	2026-05-27
59	T0pp (11B)	19.708625%	—	Imported	2026-05-27
60	Pythia (6.9B)	19.559441%	—	Imported	2026-05-27
61	UL2 (20B)	16.721542%	—	Imported	2026-05-27
62	T5 (11B)	13.136169%	—	Imported	2026-05-27
63	babbage (1.3B)	11.400428%	—	Imported	2026-05-27
64	Cohere small v20220720 (410M)	10.872046%	—	Imported	2026-05-27
65	ada (350M)	10.83284%	—	Imported	2026-05-27
66	text-ada-001	10.733701%	—	Imported	2026-05-27
67	YaLM (100B)	7.453866%	—	Imported	2026-05-27

Metadata

Metrics

Latest Results