EduBench

Education benchmark spanning student and teacher scenarios, educational tasks, and multidimensional educational response quality.

50rows
averageprimary metric
2026-05-27sampled

Metadata

Metrics

Average, Q&A, PLS, BFA, CSI

Latest Results

Rows are parsed from the scenario-level and metric-dimension EduBench README tables. Subject IDs include the evaluator because the source reports multiple evaluator-model combinations.

Rank Subject Average Model Match Provenance Sampled
1 DeepSeek R1 (scenario, evaluator: QwQ-Plus) 9.49 Imported 2026-05-27
2 DeepSeek R1 (dimension, evaluator: QwQ-Plus) 9.37 Imported 2026-05-27
3 DeepSeek R1 (scenario, evaluator: DeepSeek R1) 9.29 Imported 2026-05-27
4 Qwen Max (scenario, evaluator: QwQ-Plus) 9.18 Imported 2026-05-27
5 DeepSeek R1 (scenario, evaluator: DeepSeek V3) 9.14 Imported 2026-05-27
6 DeepSeek R1 (dimension, evaluator: DeepSeek R1) 9.12 Imported 2026-05-27
7 DeepSeek R1 (scenario, evaluator: GPT-4o) 9.06 Imported 2026-05-27
8 DeepSeek V3 (scenario, evaluator: QwQ-Plus) 9.06 Imported 2026-05-27
9 DeepSeek V3 (scenario, evaluator: DeepSeek R1) 9.05 Imported 2026-05-27
10 Qwen Max (dimension, evaluator: QwQ-Plus) 9.03 Imported 2026-05-27
11 DeepSeek V3 (scenario, evaluator: GPT-4o) 8.99 Imported 2026-05-27
12 Qwen Max (scenario, evaluator: GPT-4o) 8.99 Imported 2026-05-27
13 DeepSeek R1 (dimension, evaluator: GPT-4o) 8.98 Imported 2026-05-27
14 Qwen Max (scenario, evaluator: DeepSeek R1) 8.96 Imported 2026-05-27
15 Qwen2.5-14B-Instruct (scenario, evaluator: QwQ-Plus) 8.94 Imported 2026-05-27
16 DeepSeek R1 (dimension, evaluator: DeepSeek V3) 8.93 Imported 2026-05-27
17 DeepSeek V3 (dimension, evaluator: GPT-4o) 8.93 Imported 2026-05-27
18 DeepSeek V3 (dimension, evaluator: QwQ-Plus) 8.93 Imported 2026-05-27
19 DeepSeek V3 (dimension, evaluator: DeepSeek R1) 8.91 Imported 2026-05-27
20 Qwen Max (dimension, evaluator: GPT-4o) 8.9 Imported 2026-05-27
21 Qwen2.5-14B-Instruct (scenario, evaluator: GPT-4o) 8.87 Imported 2026-05-27
22 Qwen2.5-7B-Instruct (scenario, evaluator: GPT-4o) 8.87 Imported 2026-05-27
23 DeepSeek V3 (scenario, evaluator: DeepSeek V3) 8.86 Imported 2026-05-27
24 Qwen Max (dimension, evaluator: DeepSeek R1) 8.84 Imported 2026-05-27
25 Qwen Max (scenario, evaluator: DeepSeek V3) 8.83 Imported 2026-05-27
26 Qwen2.5-7B-Instruct (scenario, evaluator: QwQ-Plus) 8.78 Imported 2026-05-27
27 Qwen2.5-14B-Instruct (dimension, evaluator: QwQ-Plus) 8.78 Imported 2026-05-27
28 Qwen2.5-14B-Instruct (dimension, evaluator: GPT-4o) 8.77 Imported 2026-05-27
29 Qwen2.5-7B-Instruct (dimension, evaluator: GPT-4o) 8.77 Imported 2026-05-27
30 DeepSeek R1 (dimension, evaluator: Human) 8.74 Imported 2026-05-27
31 DeepSeek R1 (scenario, evaluator: Human) 8.71 Imported 2026-05-27
32 DeepSeek V3 (dimension, evaluator: DeepSeek V3) 8.66 Imported 2026-05-27
33 Qwen Max (dimension, evaluator: DeepSeek V3) 8.66 Imported 2026-05-27
34 Qwen2.5-7B-Instruct (dimension, evaluator: QwQ-Plus) 8.66 Imported 2026-05-27
35 Qwen2.5-7B-Instruct (scenario, evaluator: DeepSeek V3) 8.65 Imported 2026-05-27
36 Qwen2.5-14B-Instruct (scenario, evaluator: DeepSeek V3) 8.62 Imported 2026-05-27
37 Qwen2.5-14B-Instruct (scenario, evaluator: DeepSeek R1) 8.58 Imported 2026-05-27
38 Qwen2.5-7B-Instruct (scenario, evaluator: DeepSeek R1) 8.46 Imported 2026-05-27
39 Qwen2.5-14B-Instruct (dimension, evaluator: DeepSeek R1) 8.46 Imported 2026-05-27
40 Qwen2.5-14B-Instruct (dimension, evaluator: DeepSeek V3) 8.46 Imported 2026-05-27
41 Qwen2.5-7B-Instruct (dimension, evaluator: DeepSeek V3) 8.44 Imported 2026-05-27
42 Qwen2.5-7B-Instruct (dimension, evaluator: DeepSeek R1) 8.36 Imported 2026-05-27
43 Qwen Max (scenario, evaluator: Human) 8.06 Imported 2026-05-27
44 Qwen Max (dimension, evaluator: Human) 8.02 Imported 2026-05-27
45 DeepSeek V3 (dimension, evaluator: Human) 7.89 Imported 2026-05-27
46 DeepSeek V3 (scenario, evaluator: Human) 7.82 Imported 2026-05-27
47 Qwen2.5-14B-Instruct (scenario, evaluator: Human) 7.61 Imported 2026-05-27
48 Qwen2.5-14B-Instruct (dimension, evaluator: Human) 7.56 Imported 2026-05-27
49 Qwen2.5-7B-Instruct (scenario, evaluator: Human) 7.5 Imported 2026-05-27
50 Qwen2.5-7B-Instruct (dimension, evaluator: Human) 7.46 Imported 2026-05-27