HELM LegalBench
HELM LegalBench: Measures legal reasoning, contract review, statute interpretation, or legal-domain QA.
69rows
exact_matchprimary metric
2026-05-27sampled
Metadata
Metrics
Exact match, Denoised inference time (lower is better), # eval, # train, # prompt tokens (lower is better), # output tokens (lower is better), # trials
| Rank | Subject | Exact match | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Jurassic-2 Jumbo (178B) | 63.871847% | — | Imported | 2026-05-27 |
| 2 | LLaMA (30B) | 63.803681% | — | Imported | 2026-05-27 |
| 3 | gpt-3.5-turbo-0301 | 62.781186% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 4 | Anthropic-LM v4-s3 (52B) | 62.440354% | — | Imported | 2026-05-27 |
| 5 | Palmyra X (43B) | 62.304022% | — | Imported | 2026-05-27 |
| 6 | text-davinci-003 | 62.167689% | — | Imported | 2026-05-27 |
| 7 | text-davinci-002 | 61.486026% | — | Imported | 2026-05-27 |
| 8 | T0pp (11B) | 61.145194% | — | Imported | 2026-05-27 |
| 9 | Cohere Command beta (52.4B) | 60.599864% | — | Imported | 2026-05-27 |
| 10 | Falcon (40B) | 60.531697% | — | Imported | 2026-05-27 |
| 11 | Falcon-Instruct (40B) | 60.531697% | — | Imported | 2026-05-27 |
| 12 | Vicuna v1.3 (7B) | 60.531697% | — | Imported | 2026-05-27 |
| 13 | LLaMA (65B) | 59.100204% | — | Imported | 2026-05-27 |
| 14 | Llama 2 (13B) | 59.100204% | — | Imported | 2026-05-27 |
| 15 | Vicuna v1.3 (13B) | 58.895706% | — | Imported | 2026-05-27 |
| 16 | LLaMA (13B) | 58.691207% | — | Imported | 2026-05-27 |
| 17 | Llama 2 (70B) | 58.486708% | — | Imported | 2026-05-27 |
| 18 | Mistral v0.1 (7B) | 58.486708% | — | Imported | 2026-05-27 |
| 19 | TNLG v2 (530B) | 58.009543% | — | Imported | 2026-05-27 |
| 20 | Jurassic-2 Grande (17B) | 57.464213% | — | Imported | 2026-05-27 |
| 21 | RedPajama-INCITE-Instruct (7B) | 56.850716% | — | Imported | 2026-05-27 |
| 22 | Cohere Command beta (6.1B) | 56.646217% | — | Imported | 2026-05-27 |
| 23 | MPT (30B) | 56.441718% | — | Imported | 2026-05-27 |
| 24 | J1-Grande v2 beta (17B) | 56.237219% | — | Imported | 2026-05-27 |
| 25 | Cohere xlarge v20220609 (52.4B) | 55.828221% | — | Imported | 2026-05-27 |
| 26 | Jurassic-2 Large (7.5B) | 55.828221% | — | Imported | 2026-05-27 |
| 27 | T5 (11B) | 55.828221% | — | Imported | 2026-05-27 |
| 28 | BLOOM (176B) | 54.260395% | — | Imported | 2026-05-27 |
| 29 | MPT-Instruct (30B) | 53.783231% | — | Imported | 2026-05-27 |
| 30 | Llama 2 (7B) | 53.169734% | — | Imported | 2026-05-27 |
| 31 | OPT (175B) | 53.169734% | — | Imported | 2026-05-27 |
| 32 | Luminous Supreme (70B) | 52.965235% | — | Imported | 2026-05-27 |
| 33 | OPT (66B) | 52.69257% | — | Imported | 2026-05-27 |
| 34 | Cohere xlarge v20221108 (52.4B) | 52.556237% | — | Imported | 2026-05-27 |
| 35 | Cohere small v20220720 (410M) | 52.351738% | — | Imported | 2026-05-27 |
| 36 | Pythia (6.9B) | 52.147239% | — | Imported | 2026-05-27 |
| 37 | RedPajama-INCITE-Base (7B) | 51.738241% | — | Imported | 2026-05-27 |
| 38 | text-babbage-001 | 51.738241% | — | Imported | 2026-05-27 |
| 39 | Luminous Extended (30B) | 51.670075% | — | Imported | 2026-05-27 |
| 40 | GPT-NeoX (20B) | 51.465576% | — | Imported | 2026-05-27 |
| 41 | text-ada-001 | 51.465576% | — | Imported | 2026-05-27 |
| 42 | J1-Large v1 (7.5B) | 51.39741% | — | Imported | 2026-05-27 |
| 43 | Luminous Base (13B) | 51.329243% | — | Imported | 2026-05-27 |
| 44 | RedPajama-INCITE-Base-v1 (3B) | 51.329243% | — | Imported | 2026-05-27 |
| 45 | Falcon (7B) | 51.124744% | — | Imported | 2026-05-27 |
| 46 | Cohere medium v20220720 (6.1B) | 50.715746% | — | Imported | 2026-05-27 |
| 47 | UL2 (20B) | 50.579414% | — | Imported | 2026-05-27 |
| 48 | J1-Grande v1 (17B) | 50.443081% | — | Imported | 2026-05-27 |
| 49 | TNLG v2 (6.7B) | 50.374915% | — | Imported | 2026-05-27 |
| 50 | davinci (175B) | 49.625085% | — | Imported | 2026-05-27 |
| 51 | babbage (1.3B) | 49.216087% | — | Imported | 2026-05-27 |
| 52 | InstructPalmyra (30B) | 49.216087% | — | Imported | 2026-05-27 |
| 53 | Cohere large v20220720 (13.1B) | 49.147921% | — | Imported | 2026-05-27 |
| 54 | Pythia (12B) | 49.079755% | — | Imported | 2026-05-27 |
| 55 | curie (6.7B) | 49.011588% | — | Imported | 2026-05-27 |
| 56 | Cohere medium v20221108 (6.1B) | 48.943422% | — | Imported | 2026-05-27 |
| 57 | LLaMA (7B) | 48.466258% | — | Imported | 2026-05-27 |
| 58 | RedPajama-INCITE-Instruct-v1 (3B) | 48.466258% | — | Imported | 2026-05-27 |
| 59 | J1-Jumbo v1 (178B) | 48.398091% | — | Imported | 2026-05-27 |
| 60 | YaLM (100B) | 48.398091% | — | Imported | 2026-05-27 |
| 61 | Alpaca (7B) | 48.261759% | — | Imported | 2026-05-27 |
| 62 | GPT-J (6B) | 47.852761% | — | Imported | 2026-05-27 |
| 63 | gpt-3.5-turbo-0613 | 46.830266% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 64 | Falcon-Instruct (7B) | 45.194274% | — | Imported | 2026-05-27 |
| 65 | GLM (130B) | 45.057941% | — | Imported | 2026-05-27 |
| 66 | text-curie-001 | 44.239945% | — | Imported | 2026-05-27 |
| 67 | ada (350M) | 37.150648% | — | Imported | 2026-05-27 |
| 68 | code-cushman-001 (12B) | 0.0% | — | Imported | 2026-05-27 |
| 69 | code-davinci-002 | 0.0% | — | Imported | 2026-05-27 |
No matching rows.