K-12EduBench

K-12 education benchmark for subject knowledge, problem solving, and educational-goal cognition across school-level tasks.

23rows
lambda_js_avgprimary metric
2026-05-27sampled

Metadata

Metrics

Lambda_js Avg, Accuracy Avg

Latest Results

Rows parsed from the public K-12EduBench README. Primary score is the reported composite Lambda_js average; answer-accuracy average is joined from the README ACC table where available.

Rank Subject Lambda_js Avg Model Match Provenance Sampled
1 Doubao-Pro-32K 81.67 Imported 2026-05-27
2 DeepSeek-V3 79.67 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-27
3 Doubao-Lite-32K 74.22 Imported 2026-05-27
4 GeneralV3.5 72.83 Imported 2026-05-27
5 ERNIE-Bot 71.71 Imported 2026-05-27
6 EduChat-R1-32B 71.41 Imported 2026-05-27
7 Baichuan4-Air 71.28 Imported 2026-05-27
8 GLM-4-AirX 70.28 Imported 2026-05-27
9 Yi-Lightning 69.87 Imported 2026-05-27
10 DeepSeek-R1 69.13 R1
deepseek-r1
Imported 2026-05-27
11 Hunyuan-Standard 68.28 Imported 2026-05-27
12 Gemini-1.5-Pro 67.88 Imported 2026-05-27
13 Qwen-Turbo 65.31 Qwen-Turbo
qwen-qwen-turbo
Imported 2026-05-27
14 Grok-2 64.10 Imported 2026-05-27
15 Gemini-2.0-Flash 64.05 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-27
16 Grok-3 63.91 GROK Grok 3
xaigrok-3
Imported 2026-05-27
17 Claude-3.7-Sonnet 61.20 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-27
18 GPT-4-Turbo 55.94 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-27
19 O1-Mini 54.46 Imported 2026-05-27
20 LLaMA-3.1-70B 49.65 Imported 2026-05-27
21 Claude-3.5-Haiku 44.98 Claude 3.5 Haiku
anthropic-claude-3.5-haiku
Imported 2026-05-27
22 LLaMA-3.1-8B 21.91 Imported 2026-05-27
23 EduChat-SFT-13B 19.64 Imported 2026-05-27