CommonsenseQA | BenchmarkList

Metadata

Random split accuracy, Random split sanity, Question concept split accuracy, Question concept split sanity

Rank	Subject	Random split accuracy	Model Match	Provenance	Sampled
1	BERT-Large	55.9%	—	Imported	2026-05-27
2	GPT	45.5%	—	Imported	2026-05-27
3	ESIM+ELMo	34.1%	—	Imported	2026-05-27
4	ESIM+GloVe	32.8%	—	Imported	2026-05-27
5	QABilinear+GloVe	31.5%	—	Imported	2026-05-27
6	ESIM+Numberbatch	30.1%	—	Imported	2026-05-27
7	VecSim+Numberbatch	29.1%	—	Imported	2026-05-27
8	QABilinear+Numberbatch	28.8%	—	Imported	2026-05-27
9	LM1B-Rep	26.1%	—	Imported	2026-05-27
10	QACompare+GloVe	25.7%	—	Imported	2026-05-27
11	LM1B-Concat	25.3%	—	Imported	2026-05-27
12	VecSim+GloVe	22.3%	—	Imported	2026-05-27
13	QACompare+Numberbatch	20.4%	—	Imported	2026-05-27