Hallucinations Leaderboard

Public leaderboard evaluating LLM factuality, faithfulness, hallucination detection, instruction following, QA, reading comprehension, and summarization tasks.

42rows
average_task_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Average Task Score, Parsed Task Coverage, NQ Open/EM, TriviaQA/EM, TruthQA MC1/Acc, TruthQA MC2/Acc, TruthQA Gen/ROUGE, XSum/ROUGE, XSum/factKB, XSum/BERT-P, CNN-DM/ROUGE, CNN-DM/factKB, CNN-DM/BERT-P, RACE/Acc, SQuADv2/EM, MemoTrap/Acc, IFEval/Acc, FaithDial/Acc, HaluQA/Acc, HaluSumm/Acc, HaluDial/Acc, FEVER/Acc, TrueFalse/Acc, PopQA/EM, NQ-Swap/EM

Latest Results

Rows are merged from public per-task result JSON shards for selected high-visibility model organizations. The importer reads only aggregate result prefixes, preserves source model paths, and filters rows with fewer than eight parsed task metrics.

Rank Subject Average Task Score Model Match Provenance Sampled
1 upstage/llama-30b-instruct-2048 54.14 Imported 2026-05-06
2 HuggingFaceH4/zephyr-7b-alpha 52.08 Imported 2026-05-06
3 mistralai/Mistral-7B-Instruct-v0.2 51.63 Imported 2026-05-06
4 stabilityai/StableBeluga-13B 51.17 Imported 2026-05-06
5 mistralai/Mistral-7B-Instruct-v0.1 50.48 Mistral: Mistral 7B Instruct v0.1
mistralai-mistral-7b-instruct-v0.1
Imported 2026-05-06
6 NousResearch/Llama-2-13b-hf 49.78 Imported 2026-05-06
7 HuggingFaceH4/zephyr-7b-beta 49.62 Imported 2026-05-06
8 h2oai/h2ogpt-4096-llama2-7b-chat 47.81 Imported 2026-05-06
9 NousResearch/Nous-Hermes-Llama2-13b 47.05 Imported 2026-05-06
10 HuggingFaceH4/mistral-7b-sft-beta 46.80 Imported 2026-05-06
11 NousResearch/Nous-Hermes-llama-2-7b 46.65 Imported 2026-05-06
12 NousResearch/Llama-2-7b-chat-hf 46.54 Imported 2026-05-06
13 meta-llama/Llama-2-13b-hf 46.32 Imported 2026-05-06
14 NousResearch/Yarn-Mistral-7b-128k 46.29 Imported 2026-05-06
15 h2oai/h2ogpt-4096-llama2-13b-chat 46.05 Imported 2026-05-06
16 stabilityai/StableBeluga-7B 46.02 Imported 2026-05-06
17 meta-llama/Llama-2-13b-chat-hf 45.52 Imported 2026-05-06
18 google/gemma-7b 44.42 Imported 2026-05-06
19 microsoft/Orca-2-13b 44.31 Imported 2026-05-06
20 meta-llama/Llama-2-7b-hf 44.23 Imported 2026-05-06
21 mistralai/Mistral-7B-v0.1 44.06 Imported 2026-05-06
22 upstage/SOLAR-10.7B-Instruct-v1.0 43.71 Imported 2026-05-06
23 meta-llama/Llama-2-7b-chat-hf 43.03 Imported 2026-05-06
24 EleutherAI/llemma_7b 41.90 Imported 2026-05-06
25 NousResearch/Llama-2-7b-hf 41.03 Imported 2026-05-06
26 tiiuae/falcon-7b-instruct 40.57 Imported 2026-05-06
27 tiiuae/falcon-rw-1b 40.26 Imported 2026-05-06
28 bigscience/bloomz-3b 39.26 Imported 2026-05-06
29 upstage/SOLAR-10.7B-v1.0 39.20 Imported 2026-05-06
30 google/gemma-2b 39.09 Imported 2026-05-06
31 bigscience/bloomz-7b1 38.75 Imported 2026-05-06
32 bigscience/bloom-1b7 38.24 Imported 2026-05-06
33 EleutherAI/gpt-neo-1.3B 38.20 Imported 2026-05-06
34 bigscience/bloom-3b 37.51 Imported 2026-05-06
35 bigscience/bloom-560m 37.43 Imported 2026-05-06
36 tiiuae/falcon-7b 36.97 Imported 2026-05-06
37 bigscience/bloomz-560m 36.76 Imported 2026-05-06
38 bigscience/bloom-1b1 36.47 Imported 2026-05-06
39 EleutherAI/gpt-neo-125m 36.12 Imported 2026-05-06
40 EleutherAI/gpt-neo-2.7B 35.82 Imported 2026-05-06
41 EleutherAI/gpt-j-6b 35.30 Imported 2026-05-06
42 bigscience/bloom-7b1 34.25 Imported 2026-05-06