TruthfulQA

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

17rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Phi-3.5-MoE-instruct 0.78 Self-reported 2026-05-06
2 Granite 3.3 8B Instruct 0.67 Self-reported 2026-05-06
3 Phi 4 Mini 0.66 Self-reported 2026-05-06
4 Phi-3.5-mini-instruct 0.64 Self-reported 2026-05-06
5 Hermes 3 70B 0.63 Self-reported 2026-05-06
6 Llama 3.1 Nemotron 70B Instruct 0.59 Llama 3.1 Nemotron 70B Instruct
nvidia-llama-3.1-nemotron-70b-instruct
Self-reported 2026-05-06
7 Qwen2.5 14B Instruct 0.58 Self-reported 2026-05-06
8 Jamba 1.5 Large 0.58 Self-reported 2026-05-06
9 IBM Granite 4.0 Tiny Preview 0.58 Self-reported 2026-05-06
10 Qwen2.5 32B Instruct 0.58 Self-reported 2026-05-06
11 Command R+ 0.56 C Command R (08-2024)
cohere-command-r-08-2024
Self-reported 2026-05-06
12 Qwen2 72B Instruct 0.55 Self-reported 2026-05-06
13 Qwen2.5-Coder 32B Instruct 0.54 Qwen2.5 Coder 32B Instruct
qwen-qwen-2.5-coder-32b-instruct
Self-reported 2026-05-06
14 Jamba 1.5 Mini 0.54 Self-reported 2026-05-06
15 Granite 3.3 8B Base 0.52 Self-reported 2026-05-06
16 Qwen2.5-Coder 7B Instruct 0.51 Self-reported 2026-05-06
17 Mistral NeMo Instruct 0.50 Mistral: Mistral Nemo
mistralai-mistral-nemo
Self-reported 2026-05-06