TruthfulQA
TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.
17rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Phi-3.5-MoE-instruct | 0.78 | — | Self-reported | 2026-05-06 |
| 2 | Granite 3.3 8B Instruct | 0.67 | — | Self-reported | 2026-05-06 |
| 3 | Phi 4 Mini | 0.66 | — | Self-reported | 2026-05-06 |
| 4 | Phi-3.5-mini-instruct | 0.64 | — | Self-reported | 2026-05-06 |
| 5 | Hermes 3 70B | 0.63 | — | Self-reported | 2026-05-06 |
| 6 | Llama 3.1 Nemotron 70B Instruct | 0.59 | Llama 3.1 Nemotron 70B Instruct nvidia-llama-3.1-nemotron-70b-instruct | Self-reported | 2026-05-06 |
| 7 | Qwen2.5 14B Instruct | 0.58 | — | Self-reported | 2026-05-06 |
| 8 | Jamba 1.5 Large | 0.58 | — | Self-reported | 2026-05-06 |
| 9 | IBM Granite 4.0 Tiny Preview | 0.58 | — | Self-reported | 2026-05-06 |
| 10 | Qwen2.5 32B Instruct | 0.58 | — | Self-reported | 2026-05-06 |
| 11 | Command R+ | 0.56 | Command R (08-2024) cohere-command-r-08-2024 | Self-reported | 2026-05-06 |
| 12 | Qwen2 72B Instruct | 0.55 | — | Self-reported | 2026-05-06 |
| 13 | Qwen2.5-Coder 32B Instruct | 0.54 | Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct | Self-reported | 2026-05-06 |
| 14 | Jamba 1.5 Mini | 0.54 | — | Self-reported | 2026-05-06 |
| 15 | Granite 3.3 8B Base | 0.52 | — | Self-reported | 2026-05-06 |
| 16 | Qwen2.5-Coder 7B Instruct | 0.51 | — | Self-reported | 2026-05-06 |
| 17 | Mistral NeMo Instruct | 0.50 | Mistral: Mistral Nemo mistralai-mistral-nemo | Self-reported | 2026-05-06 |
No matching rows.