MathTutorBench
Open-ended math tutoring benchmark for measuring pedagogical capabilities such as mistake diagnosis, Socratic questioning, and scaffolding.
8rows
overall_averageprimary metric
2026-05-27sampled
Metadata
Metrics
Overall Average, Problem Solving, Socratic Questioning, Solution Correctness, Mistake Location, Mistake Correction, Scaffolding Win Rate, Pedagogy IF Win Rate, Scaffolding (Hard), Pedagogy IF (Hard)
| Rank | Subject | Overall Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | LearnLM-1.5-Pro | 0.6633 | — | Imported | 2026-05-27 |
| 2 | GPT-4o | 0.6378 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 3 | LLaMA3.1-70B-Instruct | 0.5522 | Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct | Imported | 2026-05-27 |
| 4 | LLaMA3.1-8B-Instruct | 0.4700 | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-27 |
| 5 | LLaMA3.2-3B-Instruct | 0.4689 | Llama 3.2 3B Instruct meta-llama-llama-3.2-3b-instruct | Imported | 2026-05-27 |
| 6 | Llemma-7B-ScienceTutor | 0.4078 | — | Imported | 2026-05-27 |
| 7 | Qwen2.5-7B-SocraticLM | 0.3400 | — | Imported | 2026-05-27 |
| 8 | Qwen2.5-Math-7B-Instruct | 0.3167 | — | Imported | 2026-05-27 |
No matching rows.