MATH
MATH: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
69rows
math_equivalentprimary metric
2026-05-27sampled
Metadata
Metrics
MATH Equivalent, MATH Equivalent (chain of thought)
| Rank | Subject | MATH Equivalent | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-3.5-turbo-0301 | 48.83286% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 2 | gpt-3.5-turbo-0613 | 45.27817% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 3 | code-davinci-002 | 41.017627% | — | Imported | 2026-05-27 |
| 4 | text-davinci-003 | 39.009888% | — | Imported | 2026-05-27 |
| 5 | text-davinci-002 | 32.790956% | — | Imported | 2026-05-27 |
| 6 | Palmyra X (43B) | 30.098533% | — | Imported | 2026-05-27 |
| 7 | Llama 2 (70B) | 26.084225% | — | Imported | 2026-05-27 |
| 8 | LLaMA (65B) | 22.396174% | — | Imported | 2026-05-27 |
| 9 | Falcon (40B) | 20.97848% | — | Imported | 2026-05-27 |
| 10 | Mistral v0.1 (7B) | 20.873365% | — | Imported | 2026-05-27 |
| 11 | Anthropic-LM v4-s3 (52B) | 19.793414% | — | Imported | 2026-05-27 |
| 12 | LLaMA (30B) | 19.65865% | — | Imported | 2026-05-27 |
| 13 | Jurassic-2 Jumbo (178B) | 19.553501% | — | Imported | 2026-05-27 |
| 14 | Falcon-Instruct (40B) | 18.148671% | — | Imported | 2026-05-27 |
| 15 | MPT (30B) | 17.806599% | — | Imported | 2026-05-27 |
| 16 | TNLG v2 (530B) | 15.489357% | — | Imported | 2026-05-27 |
| 17 | Luminous Supreme (70B) | 14.919933% | — | Imported | 2026-05-27 |
| 18 | Jurassic-2 Grande (17B) | 14.640906% | — | Imported | 2026-05-27 |
| 19 | Llama 2 (13B) | 14.459885% | — | Imported | 2026-05-27 |
| 20 | GPT-NeoX (20B) | 14.052507% | — | Imported | 2026-05-27 |
| 21 | Cohere xlarge v20220609 (52.4B) | 13.536209% | — | Imported | 2026-05-27 |
| 22 | LLaMA (13B) | 13.362476% | — | Imported | 2026-05-27 |
| 23 | Cohere Command beta (52.4B) | 13.256253% | — | Imported | 2026-05-27 |
| 24 | Cohere xlarge v20221108 (52.4B) | 13.177521% | — | Imported | 2026-05-27 |
| 25 | J1-Grande v2 beta (17B) | 12.740099% | — | Imported | 2026-05-27 |
| 26 | MPT-Instruct (30B) | 12.732926% | — | Imported | 2026-05-27 |
| 27 | Vicuna v1.3 (13B) | 12.035478% | — | Imported | 2026-05-27 |
| 28 | LLaMA (7B) | 11.187602% | — | Imported | 2026-05-27 |
| 29 | Luminous Extended (30B) | 11.109259% | — | Imported | 2026-05-27 |
| 30 | GPT-J (6B) | 11.076901% | — | Imported | 2026-05-27 |
| 31 | Falcon (7B) | 10.836847% | — | Imported | 2026-05-27 |
| 32 | Llama 2 (7B) | 10.733796% | — | Imported | 2026-05-27 |
| 33 | Alpaca (7B) | 10.429425% | — | Imported | 2026-05-27 |
| 34 | Pythia (12B) | 10.045208% | — | Imported | 2026-05-27 |
| 35 | RedPajama-INCITE-Base (7B) | 9.979812% | — | Imported | 2026-05-27 |
| 36 | code-cushman-001 (12B) | 9.892917% | — | Imported | 2026-05-27 |
| 37 | davinci (175B) | 9.885118% | — | Imported | 2026-05-27 |
| 38 | InstructPalmyra (30B) | 9.863626% | — | Imported | 2026-05-27 |
| 39 | Pythia (6.9B) | 9.102275% | — | Imported | 2026-05-27 |
| 40 | J1-Jumbo v1 (178B) | 8.857565% | — | Imported | 2026-05-27 |
| 41 | Luminous Base (13B) | 8.853546% | — | Imported | 2026-05-27 |
| 42 | Vicuna v1.3 (7B) | 8.77789% | — | Imported | 2026-05-27 |
| 43 | J1-Grande v1 (17B) | 7.963258% | — | Imported | 2026-05-27 |
| 44 | Cohere Command beta (6.1B) | 7.586931% | — | Imported | 2026-05-27 |
| 45 | Cohere large v20220720 (13.1B) | 7.341662% | — | Imported | 2026-05-27 |
| 46 | Jurassic-2 Large (7.5B) | 7.031941% | — | Imported | 2026-05-27 |
| 47 | Falcon-Instruct (7B) | 6.868615% | — | Imported | 2026-05-27 |
| 48 | TNLG v2 (6.7B) | 6.792603% | — | Imported | 2026-05-27 |
| 49 | OPT (175B) | 6.504253% | — | Imported | 2026-05-27 |
| 50 | RedPajama-INCITE-Instruct-v1 (3B) | 5.997848% | — | Imported | 2026-05-27 |
| 51 | RedPajama-INCITE-Base-v1 (3B) | 5.857448% | — | Imported | 2026-05-27 |
| 52 | RedPajama-INCITE-Instruct (7B) | 5.845223% | — | Imported | 2026-05-27 |
| 53 | Cohere medium v20221108 (6.1B) | 5.178122% | — | Imported | 2026-05-27 |
| 54 | curie (6.7B) | 4.964939% | — | Imported | 2026-05-27 |
| 55 | J1-Large v1 (7.5B) | 4.897953% | — | Imported | 2026-05-27 |
| 56 | Cohere medium v20220720 (6.1B) | 4.89085% | — | Imported | 2026-05-27 |
| 57 | babbage (1.3B) | 4.831366% | — | Imported | 2026-05-27 |
| 58 | OPT (66B) | 4.831072% | — | Imported | 2026-05-27 |
| 59 | ada (350M) | 4.638923% | — | Imported | 2026-05-27 |
| 60 | text-curie-001 | 4.532593% | — | Imported | 2026-05-27 |
| 61 | BLOOM (176B) | 4.346832% | — | Imported | 2026-05-27 |
| 62 | text-ada-001 | 2.043956% | — | Imported | 2026-05-27 |
| 63 | text-babbage-001 | 1.581603% | — | Imported | 2026-05-27 |
| 64 | Cohere small v20220720 (410M) | 1.570685% | — | Imported | 2026-05-27 |
| 65 | GLM (130B) | 0% | — | Imported | 2026-05-27 |
| 66 | T0pp (11B) | 0% | — | Imported | 2026-05-27 |
| 67 | T5 (11B) | 0% | — | Imported | 2026-05-27 |
| 68 | UL2 (20B) | 0% | — | Imported | 2026-05-27 |
| 69 | YaLM (100B) | 0% | — | Imported | 2026-05-27 |
No matching rows.