Halluverse-M3
Multitask multilingual hallucination detection benchmark spanning QA and dialogue summarization in English, Arabic, Hindi, and Turkish.
14rows
macro_accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Macro Accuracy, QA English Accuracy, QA Arabic Accuracy, QA Hindi Accuracy, QA Turkish Accuracy, Summarization English Accuracy, Summarization Arabic Accuracy, Summarization Hindi Accuracy, Summarization Turkish Accuracy
| Rank | Subject | Macro Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4o | 80.30% | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 2 | GPT-4.1 | 78.66% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-28 |
| 3 | Claude-3.5 | 76.98% | — | Imported | 2026-05-28 |
| 4 | Gemini-2.5 | 76.26% | — | Imported | 2026-05-28 |
| 5 | DeepSeek-V2.5 | 73.76% | — | Imported | 2026-05-28 |
| 6 | GPT-4o mini | 73.39% | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
| 7 | LLaMA-3.3 (70B) | 70.53% | — | Imported | 2026-05-28 |
| 8 | Phi-4 (14B) | 70.38% | Phi 4 microsoft-phi-4 | Imported | 2026-05-28 |
| 9 | PaLM 2 | 69.98% | — | Imported | 2026-05-28 |
| 10 | Qwen-2.5-72B | 69.81% | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-28 |
| 11 | Qwen-2.5-32B | 66.84% | — | Imported | 2026-05-28 |
| 12 | Gemma-2 (27B) | 62.10% | — | Imported | 2026-05-28 |
| 13 | Qwen-2.5-7B | 59.40% | — | Imported | 2026-05-28 |
| 14 | Mistral-7B | 56.06% | — | Imported | 2026-05-28 |
No matching rows.