CommonsenseQA

CommonsenseQA: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

13rows
random_split_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Random split accuracy, Random split sanity, Question concept split accuracy, Question concept split sanity

Latest Results

Rows are transcribed from the public CommonsenseQA paper Table 5. Primary score is random-split test accuracy.

Rank Subject Random split accuracy Model Match Provenance Sampled
1 BERT-Large 55.9% Imported 2026-05-27
2 GPT 45.5% Imported 2026-05-27
3 ESIM+ELMo 34.1% Imported 2026-05-27
4 ESIM+GloVe 32.8% Imported 2026-05-27
5 QABilinear+GloVe 31.5% Imported 2026-05-27
6 ESIM+Numberbatch 30.1% Imported 2026-05-27
7 VecSim+Numberbatch 29.1% Imported 2026-05-27
8 QABilinear+Numberbatch 28.8% Imported 2026-05-27
9 LM1B-Rep 26.1% Imported 2026-05-27
10 QACompare+GloVe 25.7% Imported 2026-05-27
11 LM1B-Concat 25.3% Imported 2026-05-27
12 VecSim+GloVe 22.3% Imported 2026-05-27
13 QACompare+Numberbatch 20.4% Imported 2026-05-27