EduBench
Education benchmark spanning student and teacher scenarios, educational tasks, and multidimensional educational response quality.
50rows
averageprimary metric
2026-05-27sampled
Metadata
Metrics
Average, Q&A, PLS, BFA, CSI
| Rank | Subject | Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | DeepSeek R1 (scenario, evaluator: QwQ-Plus) | 9.49 | — | Imported | 2026-05-27 |
| 2 | DeepSeek R1 (dimension, evaluator: QwQ-Plus) | 9.37 | — | Imported | 2026-05-27 |
| 3 | DeepSeek R1 (scenario, evaluator: DeepSeek R1) | 9.29 | — | Imported | 2026-05-27 |
| 4 | Qwen Max (scenario, evaluator: QwQ-Plus) | 9.18 | — | Imported | 2026-05-27 |
| 5 | DeepSeek R1 (scenario, evaluator: DeepSeek V3) | 9.14 | — | Imported | 2026-05-27 |
| 6 | DeepSeek R1 (dimension, evaluator: DeepSeek R1) | 9.12 | — | Imported | 2026-05-27 |
| 7 | DeepSeek R1 (scenario, evaluator: GPT-4o) | 9.06 | — | Imported | 2026-05-27 |
| 8 | DeepSeek V3 (scenario, evaluator: QwQ-Plus) | 9.06 | — | Imported | 2026-05-27 |
| 9 | DeepSeek V3 (scenario, evaluator: DeepSeek R1) | 9.05 | — | Imported | 2026-05-27 |
| 10 | Qwen Max (dimension, evaluator: QwQ-Plus) | 9.03 | — | Imported | 2026-05-27 |
| 11 | DeepSeek V3 (scenario, evaluator: GPT-4o) | 8.99 | — | Imported | 2026-05-27 |
| 12 | Qwen Max (scenario, evaluator: GPT-4o) | 8.99 | — | Imported | 2026-05-27 |
| 13 | DeepSeek R1 (dimension, evaluator: GPT-4o) | 8.98 | — | Imported | 2026-05-27 |
| 14 | Qwen Max (scenario, evaluator: DeepSeek R1) | 8.96 | — | Imported | 2026-05-27 |
| 15 | Qwen2.5-14B-Instruct (scenario, evaluator: QwQ-Plus) | 8.94 | — | Imported | 2026-05-27 |
| 16 | DeepSeek R1 (dimension, evaluator: DeepSeek V3) | 8.93 | — | Imported | 2026-05-27 |
| 17 | DeepSeek V3 (dimension, evaluator: GPT-4o) | 8.93 | — | Imported | 2026-05-27 |
| 18 | DeepSeek V3 (dimension, evaluator: QwQ-Plus) | 8.93 | — | Imported | 2026-05-27 |
| 19 | DeepSeek V3 (dimension, evaluator: DeepSeek R1) | 8.91 | — | Imported | 2026-05-27 |
| 20 | Qwen Max (dimension, evaluator: GPT-4o) | 8.9 | — | Imported | 2026-05-27 |
| 21 | Qwen2.5-14B-Instruct (scenario, evaluator: GPT-4o) | 8.87 | — | Imported | 2026-05-27 |
| 22 | Qwen2.5-7B-Instruct (scenario, evaluator: GPT-4o) | 8.87 | — | Imported | 2026-05-27 |
| 23 | DeepSeek V3 (scenario, evaluator: DeepSeek V3) | 8.86 | — | Imported | 2026-05-27 |
| 24 | Qwen Max (dimension, evaluator: DeepSeek R1) | 8.84 | — | Imported | 2026-05-27 |
| 25 | Qwen Max (scenario, evaluator: DeepSeek V3) | 8.83 | — | Imported | 2026-05-27 |
| 26 | Qwen2.5-7B-Instruct (scenario, evaluator: QwQ-Plus) | 8.78 | — | Imported | 2026-05-27 |
| 27 | Qwen2.5-14B-Instruct (dimension, evaluator: QwQ-Plus) | 8.78 | — | Imported | 2026-05-27 |
| 28 | Qwen2.5-14B-Instruct (dimension, evaluator: GPT-4o) | 8.77 | — | Imported | 2026-05-27 |
| 29 | Qwen2.5-7B-Instruct (dimension, evaluator: GPT-4o) | 8.77 | — | Imported | 2026-05-27 |
| 30 | DeepSeek R1 (dimension, evaluator: Human) | 8.74 | — | Imported | 2026-05-27 |
| 31 | DeepSeek R1 (scenario, evaluator: Human) | 8.71 | — | Imported | 2026-05-27 |
| 32 | DeepSeek V3 (dimension, evaluator: DeepSeek V3) | 8.66 | — | Imported | 2026-05-27 |
| 33 | Qwen Max (dimension, evaluator: DeepSeek V3) | 8.66 | — | Imported | 2026-05-27 |
| 34 | Qwen2.5-7B-Instruct (dimension, evaluator: QwQ-Plus) | 8.66 | — | Imported | 2026-05-27 |
| 35 | Qwen2.5-7B-Instruct (scenario, evaluator: DeepSeek V3) | 8.65 | — | Imported | 2026-05-27 |
| 36 | Qwen2.5-14B-Instruct (scenario, evaluator: DeepSeek V3) | 8.62 | — | Imported | 2026-05-27 |
| 37 | Qwen2.5-14B-Instruct (scenario, evaluator: DeepSeek R1) | 8.58 | — | Imported | 2026-05-27 |
| 38 | Qwen2.5-7B-Instruct (scenario, evaluator: DeepSeek R1) | 8.46 | — | Imported | 2026-05-27 |
| 39 | Qwen2.5-14B-Instruct (dimension, evaluator: DeepSeek R1) | 8.46 | — | Imported | 2026-05-27 |
| 40 | Qwen2.5-14B-Instruct (dimension, evaluator: DeepSeek V3) | 8.46 | — | Imported | 2026-05-27 |
| 41 | Qwen2.5-7B-Instruct (dimension, evaluator: DeepSeek V3) | 8.44 | — | Imported | 2026-05-27 |
| 42 | Qwen2.5-7B-Instruct (dimension, evaluator: DeepSeek R1) | 8.36 | — | Imported | 2026-05-27 |
| 43 | Qwen Max (scenario, evaluator: Human) | 8.06 | — | Imported | 2026-05-27 |
| 44 | Qwen Max (dimension, evaluator: Human) | 8.02 | — | Imported | 2026-05-27 |
| 45 | DeepSeek V3 (dimension, evaluator: Human) | 7.89 | — | Imported | 2026-05-27 |
| 46 | DeepSeek V3 (scenario, evaluator: Human) | 7.82 | — | Imported | 2026-05-27 |
| 47 | Qwen2.5-14B-Instruct (scenario, evaluator: Human) | 7.61 | — | Imported | 2026-05-27 |
| 48 | Qwen2.5-14B-Instruct (dimension, evaluator: Human) | 7.56 | — | Imported | 2026-05-27 |
| 49 | Qwen2.5-7B-Instruct (scenario, evaluator: Human) | 7.5 | — | Imported | 2026-05-27 |
| 50 | Qwen2.5-7B-Instruct (dimension, evaluator: Human) | 7.46 | — | Imported | 2026-05-27 |
No matching rows.