MATH

MATH: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.

69rows
math_equivalentprimary metric
2026-05-27sampled

Metadata

Metrics

MATH Equivalent, MATH Equivalent (chain of thought)

Latest Results

Rows are ranked by MATH Equivalent from the HELM Classic targeted evaluations aggregate table.

Rank Subject MATH Equivalent Model Match Provenance Sampled
1 gpt-3.5-turbo-0301 48.83286% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
2 gpt-3.5-turbo-0613 45.27817% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
3 code-davinci-002 41.017627% Imported 2026-05-27
4 text-davinci-003 39.009888% Imported 2026-05-27
5 text-davinci-002 32.790956% Imported 2026-05-27
6 Palmyra X (43B) 30.098533% Imported 2026-05-27
7 Llama 2 (70B) 26.084225% Imported 2026-05-27
8 LLaMA (65B) 22.396174% Imported 2026-05-27
9 Falcon (40B) 20.97848% Imported 2026-05-27
10 Mistral v0.1 (7B) 20.873365% Imported 2026-05-27
11 Anthropic-LM v4-s3 (52B) 19.793414% Imported 2026-05-27
12 LLaMA (30B) 19.65865% Imported 2026-05-27
13 Jurassic-2 Jumbo (178B) 19.553501% Imported 2026-05-27
14 Falcon-Instruct (40B) 18.148671% Imported 2026-05-27
15 MPT (30B) 17.806599% Imported 2026-05-27
16 TNLG v2 (530B) 15.489357% Imported 2026-05-27
17 Luminous Supreme (70B) 14.919933% Imported 2026-05-27
18 Jurassic-2 Grande (17B) 14.640906% Imported 2026-05-27
19 Llama 2 (13B) 14.459885% Imported 2026-05-27
20 GPT-NeoX (20B) 14.052507% Imported 2026-05-27
21 Cohere xlarge v20220609 (52.4B) 13.536209% Imported 2026-05-27
22 LLaMA (13B) 13.362476% Imported 2026-05-27
23 Cohere Command beta (52.4B) 13.256253% Imported 2026-05-27
24 Cohere xlarge v20221108 (52.4B) 13.177521% Imported 2026-05-27
25 J1-Grande v2 beta (17B) 12.740099% Imported 2026-05-27
26 MPT-Instruct (30B) 12.732926% Imported 2026-05-27
27 Vicuna v1.3 (13B) 12.035478% Imported 2026-05-27
28 LLaMA (7B) 11.187602% Imported 2026-05-27
29 Luminous Extended (30B) 11.109259% Imported 2026-05-27
30 GPT-J (6B) 11.076901% Imported 2026-05-27
31 Falcon (7B) 10.836847% Imported 2026-05-27
32 Llama 2 (7B) 10.733796% Imported 2026-05-27
33 Alpaca (7B) 10.429425% Imported 2026-05-27
34 Pythia (12B) 10.045208% Imported 2026-05-27
35 RedPajama-INCITE-Base (7B) 9.979812% Imported 2026-05-27
36 code-cushman-001 (12B) 9.892917% Imported 2026-05-27
37 davinci (175B) 9.885118% Imported 2026-05-27
38 InstructPalmyra (30B) 9.863626% Imported 2026-05-27
39 Pythia (6.9B) 9.102275% Imported 2026-05-27
40 J1-Jumbo v1 (178B) 8.857565% Imported 2026-05-27
41 Luminous Base (13B) 8.853546% Imported 2026-05-27
42 Vicuna v1.3 (7B) 8.77789% Imported 2026-05-27
43 J1-Grande v1 (17B) 7.963258% Imported 2026-05-27
44 Cohere Command beta (6.1B) 7.586931% Imported 2026-05-27
45 Cohere large v20220720 (13.1B) 7.341662% Imported 2026-05-27
46 Jurassic-2 Large (7.5B) 7.031941% Imported 2026-05-27
47 Falcon-Instruct (7B) 6.868615% Imported 2026-05-27
48 TNLG v2 (6.7B) 6.792603% Imported 2026-05-27
49 OPT (175B) 6.504253% Imported 2026-05-27
50 RedPajama-INCITE-Instruct-v1 (3B) 5.997848% Imported 2026-05-27
51 RedPajama-INCITE-Base-v1 (3B) 5.857448% Imported 2026-05-27
52 RedPajama-INCITE-Instruct (7B) 5.845223% Imported 2026-05-27
53 Cohere medium v20221108 (6.1B) 5.178122% Imported 2026-05-27
54 curie (6.7B) 4.964939% Imported 2026-05-27
55 J1-Large v1 (7.5B) 4.897953% Imported 2026-05-27
56 Cohere medium v20220720 (6.1B) 4.89085% Imported 2026-05-27
57 babbage (1.3B) 4.831366% Imported 2026-05-27
58 OPT (66B) 4.831072% Imported 2026-05-27
59 ada (350M) 4.638923% Imported 2026-05-27
60 text-curie-001 4.532593% Imported 2026-05-27
61 BLOOM (176B) 4.346832% Imported 2026-05-27
62 text-ada-001 2.043956% Imported 2026-05-27
63 text-babbage-001 1.581603% Imported 2026-05-27
64 Cohere small v20220720 (410M) 1.570685% Imported 2026-05-27
65 GLM (130B) 0% Imported 2026-05-27
66 T0pp (11B) 0% Imported 2026-05-27
67 T5 (11B) 0% Imported 2026-05-27
68 UL2 (20B) 0% Imported 2026-05-27
69 YaLM (100B) 0% Imported 2026-05-27