MathTutorBench

Open-ended math tutoring benchmark for measuring pedagogical capabilities such as mistake diagnosis, Socratic questioning, and scaffolding.

8rows
overall_averageprimary metric
2026-05-27sampled

Metadata

Metrics

Overall Average, Problem Solving, Socratic Questioning, Solution Correctness, Mistake Location, Mistake Correction, Scaffolding Win Rate, Pedagogy IF Win Rate, Scaffolding (Hard), Pedagogy IF (Hard)

Latest Results

Rows parsed from the public MathTutorBench HTML table. Overall average is a BenchmarkList-derived mean of the nine published capability metrics.

Rank Subject Overall Average Model Match Provenance Sampled
1 LearnLM-1.5-Pro 0.6633 Imported 2026-05-27
2 GPT-4o 0.6378 GPT-4o
openai-gpt-4o
Imported 2026-05-27
3 LLaMA3.1-70B-Instruct 0.5522 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-27
4 LLaMA3.1-8B-Instruct 0.4700 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-27
5 LLaMA3.2-3B-Instruct 0.4689 Llama 3.2 3B Instruct
meta-llama-llama-3.2-3b-instruct
Imported 2026-05-27
6 Llemma-7B-ScienceTutor 0.4078 Imported 2026-05-27
7 Qwen2.5-7B-SocraticLM 0.3400 Imported 2026-05-27
8 Qwen2.5-Math-7B-Instruct 0.3167 Imported 2026-05-27