K-12EduBench
K-12 education benchmark for subject knowledge, problem solving, and educational-goal cognition across school-level tasks.
23rows
lambda_js_avgprimary metric
2026-05-27sampled
Metadata
Metrics
Lambda_js Avg, Accuracy Avg
| Rank | Subject | Lambda_js Avg | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Doubao-Pro-32K | 81.67 | — | Imported | 2026-05-27 |
| 2 | DeepSeek-V3 | 79.67 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-27 |
| 3 | Doubao-Lite-32K | 74.22 | — | Imported | 2026-05-27 |
| 4 | GeneralV3.5 | 72.83 | — | Imported | 2026-05-27 |
| 5 | ERNIE-Bot | 71.71 | — | Imported | 2026-05-27 |
| 6 | EduChat-R1-32B | 71.41 | — | Imported | 2026-05-27 |
| 7 | Baichuan4-Air | 71.28 | — | Imported | 2026-05-27 |
| 8 | GLM-4-AirX | 70.28 | — | Imported | 2026-05-27 |
| 9 | Yi-Lightning | 69.87 | — | Imported | 2026-05-27 |
| 10 | DeepSeek-R1 | 69.13 | R1 deepseek-r1 | Imported | 2026-05-27 |
| 11 | Hunyuan-Standard | 68.28 | — | Imported | 2026-05-27 |
| 12 | Gemini-1.5-Pro | 67.88 | — | Imported | 2026-05-27 |
| 13 | Qwen-Turbo | 65.31 | Qwen-Turbo qwen-qwen-turbo | Imported | 2026-05-27 |
| 14 | Grok-2 | 64.10 | — | Imported | 2026-05-27 |
| 15 | Gemini-2.0-Flash | 64.05 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 16 | Grok-3 | 63.91 | Grok 3 xaigrok-3 | Imported | 2026-05-27 |
| 17 | Claude-3.7-Sonnet | 61.20 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-27 |
| 18 | GPT-4-Turbo | 55.94 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-27 |
| 19 | O1-Mini | 54.46 | — | Imported | 2026-05-27 |
| 20 | LLaMA-3.1-70B | 49.65 | — | Imported | 2026-05-27 |
| 21 | Claude-3.5-Haiku | 44.98 | Claude 3.5 Haiku anthropic-claude-3.5-haiku | Imported | 2026-05-27 |
| 22 | LLaMA-3.1-8B | 21.91 | — | Imported | 2026-05-27 |
| 23 | EduChat-SFT-13B | 19.64 | — | Imported | 2026-05-27 |
No matching rows.