AGIEval

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.

10rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Mistral Small 3 24B Base 0.66 Self-reported 2026-05-06
2 Ministral 3 (14B Base 2512) 0.65 Self-reported 2026-05-06
3 Ministral 3 (8B Base 2512) 0.59 Self-reported 2026-05-06
4 Hermes 3 70B 0.56 Self-reported 2026-05-06
5 Gemma 2 27B 0.55 Gemma 2 27B
google-gemma-2-27b-it
Self-reported 2026-05-06
6 Gemma 2 9B 0.53 Self-reported 2026-05-06
7 Ministral 3 (3B Base 2512) 0.51 Self-reported 2026-05-06
8 Granite 3.3 8B Base 0.49 Self-reported 2026-05-06
9 Ministral 8B Instruct 0.48 Self-reported 2026-05-06
10 ERNIE 4.5 0.28 ERNIE 4.5 300B A47B
baidu-ernie-4.5-300b-a47b
Self-reported 2026-05-06