AHa-Bench

Audio hallucination benchmark for large audio-language models across semantic, acoustic, and confusion hallucination types.

9rows
hallucination_accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Hallucination Accuracy, Hallucination Error Rate (lower is better)

Latest Results

Rows are imported from the public AHa-Bench paper Table 2 mean accuracy column. The project page qualitative examples are not used as aggregate scores.

Rank Subject Hallucination Accuracy Model Match Provenance Sampled
1 Gemini-2.5-Pro 60% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
2 GPT-Audio 28.75% GPT Audio
openai-gpt-audio
Imported 2026-05-28
3 Kimi-Audio 23.94% Imported 2026-05-28
4 Qwen-Audio 22.46% Imported 2026-05-28
5 Qwen2-Audio-Inst 20.73% Imported 2026-05-28
6 FunAudioLLM 20.54% Imported 2026-05-28
7 GLM4-Voice 16.42% Imported 2026-05-28
8 Qwen2-Audio 16.15% Imported 2026-05-28
9 SALMONN 7.76% Imported 2026-05-28