AudioBench

Audio-language benchmark covering speech, sound, music, ASR, QA, translation, and audio reasoning tasks across many task-level metrics.

13rows
aggregate_non_wer_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Aggregate Non-WER Score, Average Llama 3 70B Judge, Average GPT-4o Judge, Average String Match, Average BLEU, Average METEOR, Average WER (lower is better)

Latest Results

Rows are aggregated by model from the public AudioBench leaderboard JSON across 91 task entries. Non-WER values on a 0-1 scale are normalized to percentages before aggregation.

Rank Subject Aggregate Non-WER Score Model Match Provenance Sampled
1 gpt-4o-audio 57.1219 GPT-4o Audio
openai-gpt-4o-audio-preview
Imported 2026-05-27
2 gemini-1.5-flash 47.951 Imported 2026-05-27
3 MERaLiON-AudioLLM-Whisper-SEA-LION 47.8887 Imported 2026-05-27
4 phi_4_multimodal_instruct 37.3843 Imported 2026-05-27
5 seallms_audio_7b 36.7351 Imported 2026-05-27
6 Qwen2-Audio-7B-Instruct 36.6041 Imported 2026-05-27
7 cascade_whisper_large_v3_llama_3_8b_instruct 36.3648 Imported 2026-05-27
8 cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct 32.4654 Imported 2026-05-27
9 Qwen-Audio-Chat 30.5138 Imported 2026-05-27
10 Marco-LLM-ST 30.0295 Imported 2026-05-27
11 SALMONN_7B 29.4734 Imported 2026-05-27
12 WavLLM_fairseq 20.4118 Imported 2026-05-27
13 whisper_large_v3 13.8762 Imported 2026-05-27