AGIEval
A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.
10rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Mistral Small 3 24B Base | 0.66 | — | Self-reported | 2026-05-06 |
| 2 | Ministral 3 (14B Base 2512) | 0.65 | — | Self-reported | 2026-05-06 |
| 3 | Ministral 3 (8B Base 2512) | 0.59 | — | Self-reported | 2026-05-06 |
| 4 | Hermes 3 70B | 0.56 | — | Self-reported | 2026-05-06 |
| 5 | Gemma 2 27B | 0.55 | Gemma 2 27B google-gemma-2-27b-it | Self-reported | 2026-05-06 |
| 6 | Gemma 2 9B | 0.53 | — | Self-reported | 2026-05-06 |
| 7 | Ministral 3 (3B Base 2512) | 0.51 | — | Self-reported | 2026-05-06 |
| 8 | Granite 3.3 8B Base | 0.49 | — | Self-reported | 2026-05-06 |
| 9 | Ministral 8B Instruct | 0.48 | — | Self-reported | 2026-05-06 |
| 10 | ERNIE 4.5 | 0.28 | ERNIE 4.5 300B A47B baidu-ernie-4.5-300b-a47b | Self-reported | 2026-05-06 |
No matching rows.