TutorBench
TutorBench evaluates how well LLMs perform common tutoring tasks for high school and AP-level subjects.
27rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Confidence Interval Upper, Max Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Muse Spark | 68.55 | — | Imported | 2026-05-06 |
| 1 | gpt-5.4-pro-2026-03-05 | 56.62 | GPT-5.4 Pro openai-gpt-5.4-pro | Imported | 2026-05-06 |
| 1 | gemini-2.5-pro-preview-06-05 | 55.65 | Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview | Imported | 2026-05-06 |
| 1 | gpt-5-2025-08-07 | 55.33 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 1 | o3-pro-2025-06-10 | 54.62 | o3 Pro openai-o3-pro | Imported | 2026-05-06 |
| 1 | kimi-k2.5 | 54.56 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 2 | gpt-5.1-thinking | 54.09 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-06 |
| 2 | claude-opus-4-6-thinking-max | 53.68 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | gemini-3-pro-preview | 53.67 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 2 | claude-opus-4-6 (Non-Thinking) | 53.55 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | gpt-5.2-2025-12-11 | 53.49 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 4 | gemini-3.1-pro-preview | 52.99 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 4 | o3-2025-04-16-medium | 52.76 | — | Imported | 2026-05-06 |
| 6 | o3-2025-04-16-high | 52.09 | — | Imported | 2026-05-06 |
| 9 | gemini-3.1-flash-lite-preview | 51.50 | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-06 |
| 11 | claude-opus-4-5-20251101-thinking | 51.20 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 12 | claude-opus-4-1-20250805-thinking | 50.78 | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-05-06 |
| 14 | claude-opus-4-5-20251101 | 49.82 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 14 | claude-4-opus-20250514-thinking | 49.71 | — | Imported | 2026-05-06 |
| 16 | gpt-5.1-instant | 49.08 | GPT-5.1 Chat openai-gpt-5.1-chat | Imported | 2026-05-06 |
| 16 | claude-sonnet-4-5-20250929-thinking | 49 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 19 | claude-opus-4-1-20250805_anthropic | 47.40 | — | Imported | 2026-05-06 |
| 21 | claude-37-sonnet-thinking | 46.45 | Claude 3.7 Sonnet (thinking) anthropic-claude-3.7-sonnet-thinking | Imported | 2026-05-06 |
| 21 | claude-sonnet-4-5-20250929 | 45.70 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 21 | claude-opus-4-20250514 | 45.46 | Claude Opus 4 anthropic-claude-opus-4 | Imported | 2026-05-06 |
| 25 | llama4-maverick | 40.20 | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-06 |
| 26 | gpt-4o | 36.12 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
No matching rows.