TruthfulQA | BenchmarkList

Metadata

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Phi-3.5-MoE-instruct	0.78	—	Self-reported	2026-05-06
2	Granite 3.3 8B Instruct	0.67	—	Self-reported	2026-05-06
3	Phi 4 Mini	0.66	—	Self-reported	2026-05-06
4	Phi-3.5-mini-instruct	0.64	—	Self-reported	2026-05-06
5	Hermes 3 70B	0.63	—	Self-reported	2026-05-06
6	Llama 3.1 Nemotron 70B Instruct	0.59	Llama 3.1 Nemotron 70B Instruct nvidia-llama-3.1-nemotron-70b-instruct	Self-reported	2026-05-06
7	Qwen2.5 14B Instruct	0.58	—	Self-reported	2026-05-06
8	Jamba 1.5 Large	0.58	—	Self-reported	2026-05-06
9	IBM Granite 4.0 Tiny Preview	0.58	—	Self-reported	2026-05-06
10	Qwen2.5 32B Instruct	0.58	—	Self-reported	2026-05-06
11	Command R+	0.56	C Command R (08-2024) cohere-command-r-08-2024	Self-reported	2026-05-06
12	Qwen2 72B Instruct	0.55	—	Self-reported	2026-05-06
13	Qwen2.5-Coder 32B Instruct	0.54	Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct	Self-reported	2026-05-06
14	Jamba 1.5 Mini	0.54	—	Self-reported	2026-05-06
15	Granite 3.3 8B Base	0.52	—	Self-reported	2026-05-06
16	Qwen2.5-Coder 7B Instruct	0.51	—	Self-reported	2026-05-06
17	Mistral NeMo Instruct	0.50	Mistral: Mistral Nemo mistralai-mistral-nemo	Self-reported	2026-05-06