MATH (CoT)

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects. This variant uses Chain-of-Thought prompting to encourage step-by-step reasoning.

6rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Llama 3.1 70B Instruct 0.68 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Self-reported 2026-05-06
2 Ministral 3 (14B Base 2512) 0.68 Self-reported 2026-05-06
2 Mistral Large 3 0.68 Self-reported 2026-05-06
4 Ministral 3 (8B Base 2512) 0.63 Self-reported 2026-05-06
5 Ministral 3 (3B Base 2512) 0.60 Self-reported 2026-05-06
6 Llama 3.1 8B Instruct 0.52 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Self-reported 2026-05-06