HELM
HELM: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
67rows
mean_win_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Mean win rate, MMLU Exact Match, BoolQ Exact Match, NarrativeQA F1, NaturalQuestions closed-book F1, NaturalQuestions open-book F1
| Rank | Subject | Mean win rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Llama 2 (70B) | 94.351981% | — | Imported | 2026-05-27 |
| 2 | LLaMA (65B) | 90.825175% | — | Imported | 2026-05-27 |
| 3 | text-davinci-002 | 90.502788% | — | Imported | 2026-05-27 |
| 4 | Mistral v0.1 (7B) | 88.403263% | — | Imported | 2026-05-27 |
| 5 | Cohere Command beta (52.4B) | 87.449068% | — | Imported | 2026-05-27 |
| 6 | text-davinci-003 | 87.159957% | — | Imported | 2026-05-27 |
| 7 | Jurassic-2 Jumbo (178B) | 82.438267% | — | Imported | 2026-05-27 |
| 8 | Llama 2 (13B) | 82.300699% | — | Imported | 2026-05-27 |
| 9 | TNLG v2 (530B) | 78.65258% | — | Imported | 2026-05-27 |
| 10 | gpt-3.5-turbo-0613 | 78.296037% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 11 | LLaMA (30B) | 78.128205% | — | Imported | 2026-05-27 |
| 12 | Anthropic-LM v4-s3 (52B) | 78.037742% | — | Imported | 2026-05-27 |
| 13 | gpt-3.5-turbo-0301 | 76.025641% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 14 | Jurassic-2 Grande (17B) | 74.324689% | — | Imported | 2026-05-27 |
| 15 | Palmyra X (43B) | 73.246451% | — | Imported | 2026-05-27 |
| 16 | Falcon (40B) | 72.939394% | — | Imported | 2026-05-27 |
| 17 | Falcon-Instruct (40B) | 72.655012% | — | Imported | 2026-05-27 |
| 18 | MPT-Instruct (30B) | 71.638695% | — | Imported | 2026-05-27 |
| 19 | MPT (30B) | 71.449883% | — | Imported | 2026-05-27 |
| 20 | J1-Grande v2 beta (17B) | 70.639592% | — | Imported | 2026-05-27 |
| 21 | Vicuna v1.3 (13B) | 70.631702% | — | Imported | 2026-05-27 |
| 22 | Cohere Command beta (6.1B) | 67.521002% | — | Imported | 2026-05-27 |
| 23 | Cohere xlarge v20221108 (52.4B) | 66.395092% | — | Imported | 2026-05-27 |
| 24 | Luminous Supreme (70B) | 66.159188% | — | Imported | 2026-05-27 |
| 25 | Vicuna v1.3 (7B) | 62.526807% | — | Imported | 2026-05-27 |
| 26 | OPT (175B) | 60.945576% | — | Imported | 2026-05-27 |
| 27 | Llama 2 (7B) | 60.731935% | — | Imported | 2026-05-27 |
| 28 | LLaMA (13B) | 59.468531% | — | Imported | 2026-05-27 |
| 29 | InstructPalmyra (30B) | 56.845377% | — | Imported | 2026-05-27 |
| 30 | Cohere xlarge v20220609 (52.4B) | 55.954648% | — | Imported | 2026-05-27 |
| 31 | Jurassic-2 Large (7.5B) | 55.296191% | — | Imported | 2026-05-27 |
| 32 | davinci (175B) | 53.770035% | — | Imported | 2026-05-27 |
| 33 | LLaMA (7B) | 53.268065% | — | Imported | 2026-05-27 |
| 34 | RedPajama-INCITE-Instruct (7B) | 52.424242% | — | Imported | 2026-05-27 |
| 35 | J1-Jumbo v1 (178B) | 51.650298% | — | Imported | 2026-05-27 |
| 36 | GLM (130B) | 51.212121% | — | Imported | 2026-05-27 |
| 37 | Luminous Extended (30B) | 48.50136% | — | Imported | 2026-05-27 |
| 38 | OPT (66B) | 44.801021% | — | Imported | 2026-05-27 |
| 39 | BLOOM (176B) | 44.607322% | — | Imported | 2026-05-27 |
| 40 | J1-Grande v1 (17B) | 43.263522% | — | Imported | 2026-05-27 |
| 41 | Alpaca (7B) | 38.088578% | — | Imported | 2026-05-27 |
| 42 | Falcon (7B) | 37.834499% | — | Imported | 2026-05-27 |
| 43 | RedPajama-INCITE-Base (7B) | 37.806527% | — | Imported | 2026-05-27 |
| 44 | Cohere large v20220720 (13.1B) | 37.183516% | — | Imported | 2026-05-27 |
| 45 | RedPajama-INCITE-Instruct-v1 (3B) | 36.608392% | — | Imported | 2026-05-27 |
| 46 | text-curie-001 | 35.974581% | — | Imported | 2026-05-27 |
| 47 | GPT-NeoX (20B) | 35.097193% | — | Imported | 2026-05-27 |
| 48 | Luminous Base (13B) | 31.543318% | — | Imported | 2026-05-27 |
| 49 | Cohere medium v20221108 (6.1B) | 31.207175% | — | Imported | 2026-05-27 |
| 50 | RedPajama-INCITE-Base-v1 (3B) | 31.081585% | — | Imported | 2026-05-27 |
| 51 | TNLG v2 (6.7B) | 30.921482% | — | Imported | 2026-05-27 |
| 52 | J1-Large v1 (7.5B) | 28.522344% | — | Imported | 2026-05-27 |
| 53 | GPT-J (6B) | 27.275385% | — | Imported | 2026-05-27 |
| 54 | Pythia (12B) | 25.678322% | — | Imported | 2026-05-27 |
| 55 | curie (6.7B) | 24.737934% | — | Imported | 2026-05-27 |
| 56 | Falcon-Instruct (7B) | 24.405594% | — | Imported | 2026-05-27 |
| 57 | Cohere medium v20220720 (6.1B) | 22.967173% | — | Imported | 2026-05-27 |
| 58 | text-babbage-001 | 22.864976% | — | Imported | 2026-05-27 |
| 59 | T0pp (11B) | 19.708625% | — | Imported | 2026-05-27 |
| 60 | Pythia (6.9B) | 19.559441% | — | Imported | 2026-05-27 |
| 61 | UL2 (20B) | 16.721542% | — | Imported | 2026-05-27 |
| 62 | T5 (11B) | 13.136169% | — | Imported | 2026-05-27 |
| 63 | babbage (1.3B) | 11.400428% | — | Imported | 2026-05-27 |
| 64 | Cohere small v20220720 (410M) | 10.872046% | — | Imported | 2026-05-27 |
| 65 | ada (350M) | 10.83284% | — | Imported | 2026-05-27 |
| 66 | text-ada-001 | 10.733701% | — | Imported | 2026-05-27 |
| 67 | YaLM (100B) | 7.453866% | — | Imported | 2026-05-27 |
No matching rows.