Hallucinations Leaderboard
Public leaderboard evaluating LLM factuality, faithfulness, hallucination detection, instruction following, QA, reading comprehension, and summarization tasks.
42rows
average_task_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Average Task Score, Parsed Task Coverage, NQ Open/EM, TriviaQA/EM, TruthQA MC1/Acc, TruthQA MC2/Acc, TruthQA Gen/ROUGE, XSum/ROUGE, XSum/factKB, XSum/BERT-P, CNN-DM/ROUGE, CNN-DM/factKB, CNN-DM/BERT-P, RACE/Acc, SQuADv2/EM, MemoTrap/Acc, IFEval/Acc, FaithDial/Acc, HaluQA/Acc, HaluSumm/Acc, HaluDial/Acc, FEVER/Acc, TrueFalse/Acc, PopQA/EM, NQ-Swap/EM
| Rank | Subject | Average Task Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | upstage/llama-30b-instruct-2048 | 54.14 | — | Imported | 2026-05-06 |
| 2 | HuggingFaceH4/zephyr-7b-alpha | 52.08 | — | Imported | 2026-05-06 |
| 3 | mistralai/Mistral-7B-Instruct-v0.2 | 51.63 | — | Imported | 2026-05-06 |
| 4 | stabilityai/StableBeluga-13B | 51.17 | — | Imported | 2026-05-06 |
| 5 | mistralai/Mistral-7B-Instruct-v0.1 | 50.48 | Mistral: Mistral 7B Instruct v0.1 mistralai-mistral-7b-instruct-v0.1 | Imported | 2026-05-06 |
| 6 | NousResearch/Llama-2-13b-hf | 49.78 | — | Imported | 2026-05-06 |
| 7 | HuggingFaceH4/zephyr-7b-beta | 49.62 | — | Imported | 2026-05-06 |
| 8 | h2oai/h2ogpt-4096-llama2-7b-chat | 47.81 | — | Imported | 2026-05-06 |
| 9 | NousResearch/Nous-Hermes-Llama2-13b | 47.05 | — | Imported | 2026-05-06 |
| 10 | HuggingFaceH4/mistral-7b-sft-beta | 46.80 | — | Imported | 2026-05-06 |
| 11 | NousResearch/Nous-Hermes-llama-2-7b | 46.65 | — | Imported | 2026-05-06 |
| 12 | NousResearch/Llama-2-7b-chat-hf | 46.54 | — | Imported | 2026-05-06 |
| 13 | meta-llama/Llama-2-13b-hf | 46.32 | — | Imported | 2026-05-06 |
| 14 | NousResearch/Yarn-Mistral-7b-128k | 46.29 | — | Imported | 2026-05-06 |
| 15 | h2oai/h2ogpt-4096-llama2-13b-chat | 46.05 | — | Imported | 2026-05-06 |
| 16 | stabilityai/StableBeluga-7B | 46.02 | — | Imported | 2026-05-06 |
| 17 | meta-llama/Llama-2-13b-chat-hf | 45.52 | — | Imported | 2026-05-06 |
| 18 | google/gemma-7b | 44.42 | — | Imported | 2026-05-06 |
| 19 | microsoft/Orca-2-13b | 44.31 | — | Imported | 2026-05-06 |
| 20 | meta-llama/Llama-2-7b-hf | 44.23 | — | Imported | 2026-05-06 |
| 21 | mistralai/Mistral-7B-v0.1 | 44.06 | — | Imported | 2026-05-06 |
| 22 | upstage/SOLAR-10.7B-Instruct-v1.0 | 43.71 | — | Imported | 2026-05-06 |
| 23 | meta-llama/Llama-2-7b-chat-hf | 43.03 | — | Imported | 2026-05-06 |
| 24 | EleutherAI/llemma_7b | 41.90 | — | Imported | 2026-05-06 |
| 25 | NousResearch/Llama-2-7b-hf | 41.03 | — | Imported | 2026-05-06 |
| 26 | tiiuae/falcon-7b-instruct | 40.57 | — | Imported | 2026-05-06 |
| 27 | tiiuae/falcon-rw-1b | 40.26 | — | Imported | 2026-05-06 |
| 28 | bigscience/bloomz-3b | 39.26 | — | Imported | 2026-05-06 |
| 29 | upstage/SOLAR-10.7B-v1.0 | 39.20 | — | Imported | 2026-05-06 |
| 30 | google/gemma-2b | 39.09 | — | Imported | 2026-05-06 |
| 31 | bigscience/bloomz-7b1 | 38.75 | — | Imported | 2026-05-06 |
| 32 | bigscience/bloom-1b7 | 38.24 | — | Imported | 2026-05-06 |
| 33 | EleutherAI/gpt-neo-1.3B | 38.20 | — | Imported | 2026-05-06 |
| 34 | bigscience/bloom-3b | 37.51 | — | Imported | 2026-05-06 |
| 35 | bigscience/bloom-560m | 37.43 | — | Imported | 2026-05-06 |
| 36 | tiiuae/falcon-7b | 36.97 | — | Imported | 2026-05-06 |
| 37 | bigscience/bloomz-560m | 36.76 | — | Imported | 2026-05-06 |
| 38 | bigscience/bloom-1b1 | 36.47 | — | Imported | 2026-05-06 |
| 39 | EleutherAI/gpt-neo-125m | 36.12 | — | Imported | 2026-05-06 |
| 40 | EleutherAI/gpt-neo-2.7B | 35.82 | — | Imported | 2026-05-06 |
| 41 | EleutherAI/gpt-j-6b | 35.30 | — | Imported | 2026-05-06 |
| 42 | bigscience/bloom-7b1 | 34.25 | — | Imported | 2026-05-06 |
No matching rows.