Halluverse-M3

Multitask multilingual hallucination detection benchmark spanning QA and dialogue summarization in English, Arabic, Hindi, and Turkish.

14rows
macro_accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Macro Accuracy, QA English Accuracy, QA Arabic Accuracy, QA Hindi Accuracy, QA Turkish Accuracy, Summarization English Accuracy, Summarization Arabic Accuracy, Summarization Hindi Accuracy, Summarization Turkish Accuracy

Latest Results

Rows are imported from public arXiv source LaTeX. The source table reports hallucination-detection accuracy across QA and dialogue summarization in four languages.

Rank Subject Macro Accuracy Model Match Provenance Sampled
1 GPT-4o 80.30% GPT-4o
openai-gpt-4o
Imported 2026-05-28
2 GPT-4.1 78.66% GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
3 Claude-3.5 76.98% Imported 2026-05-28
4 Gemini-2.5 76.26% Imported 2026-05-28
5 DeepSeek-V2.5 73.76% Imported 2026-05-28
6 GPT-4o mini 73.39% GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-28
7 LLaMA-3.3 (70B) 70.53% Imported 2026-05-28
8 Phi-4 (14B) 70.38% Phi 4
microsoft-phi-4
Imported 2026-05-28
9 PaLM 2 69.98% Imported 2026-05-28
10 Qwen-2.5-72B 69.81% Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-28
11 Qwen-2.5-32B 66.84% Imported 2026-05-28
12 Gemma-2 (27B) 62.10% Imported 2026-05-28
13 Qwen-2.5-7B 59.40% Imported 2026-05-28
14 Mistral-7B 56.06% Imported 2026-05-28