TutorBench

TutorBench evaluates how well LLMs perform common tutoring tasks for high school and AP-level subjects.

27rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Confidence Interval Upper, Max Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Muse Spark 68.55 Imported 2026-05-06
1 gpt-5.4-pro-2026-03-05 56.62 GPT-5.4 Pro
openai-gpt-5.4-pro
Imported 2026-05-06
1 gemini-2.5-pro-preview-06-05 55.65 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Imported 2026-05-06
1 gpt-5-2025-08-07 55.33 GPT-5
openai-gpt-5
Imported 2026-05-06
1 o3-pro-2025-06-10 54.62 o3 Pro
openai-o3-pro
Imported 2026-05-06
1 kimi-k2.5 54.56 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
2 gpt-5.1-thinking 54.09 GPT-5.1
openai-gpt-5.1
Imported 2026-05-06
2 claude-opus-4-6-thinking-max 53.68 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 gemini-3-pro-preview 53.67 Gemini 3
google-gemini-3
Imported 2026-05-06
2 claude-opus-4-6 (Non-Thinking) 53.55 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 gpt-5.2-2025-12-11 53.49 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
4 gemini-3.1-pro-preview 52.99 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
4 o3-2025-04-16-medium 52.76 Imported 2026-05-06
6 o3-2025-04-16-high 52.09 Imported 2026-05-06
9 gemini-3.1-flash-lite-preview 51.50 Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-06
11 claude-opus-4-5-20251101-thinking 51.20 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
12 claude-opus-4-1-20250805-thinking 50.78 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
14 claude-opus-4-5-20251101 49.82 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
14 claude-4-opus-20250514-thinking 49.71 Imported 2026-05-06
16 gpt-5.1-instant 49.08 GPT-5.1 Chat
openai-gpt-5.1-chat
Imported 2026-05-06
16 claude-sonnet-4-5-20250929-thinking 49 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
19 claude-opus-4-1-20250805_anthropic 47.40 Imported 2026-05-06
21 claude-37-sonnet-thinking 46.45 Claude 3.7 Sonnet (thinking)
anthropic-claude-3.7-sonnet-thinking
Imported 2026-05-06
21 claude-sonnet-4-5-20250929 45.70 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
21 claude-opus-4-20250514 45.46 Claude Opus 4
anthropic-claude-opus-4
Imported 2026-05-06
25 llama4-maverick 40.20 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-06
26 gpt-4o 36.12 GPT-4o
openai-gpt-4o
Imported 2026-05-06