AIR-Bench

Audio instruction benchmark evaluating large audio-language models on generative comprehension across chat and foundation audio tasks.

18rows
averageprimary metric
2026-05-27sampled

Metadata

Metrics

Chat Average, Foundation Average, Speech, Sound, Music, Mixed Audio

Latest Results

Rows are parsed from the public AIR-Bench README chat and foundation leaderboard tables. Foundation rows use the mean of finite task percentages as primary score.

Rank Subject Chat Average Model Match Provenance Sampled
1 Qwen-Audio-Turbo (foundation) 58.285 Imported 2026-05-27
2 Qwen-Audio (foundation) 54.595 Imported 2026-05-27
3 Whisper+GPT 4 (foundation) 53.5889 GPT-4
openai-gpt-4
Imported 2026-05-27
4 Pandagpt (foundation) 39.72 Imported 2026-05-27
5 SALMONN (foundation) 36.53 Imported 2026-05-27
6 BLSP (foundation) 32.16 Imported 2026-05-27
7 Next-gpt (foundation) 31.765 Imported 2026-05-27
8 SpeechGPT (foundation) 30.885 Imported 2026-05-27
9 Qwen2-Audio (chat) 6.93 Imported 2026-05-27
10 Qwen-Audio-Turbo (chat) 6.34 Imported 2026-05-27
11 SALMONN (chat) 6.11 Imported 2026-05-27
12 Qwen-Audio (chat) 6.08 Imported 2026-05-27
13 Gemini-1.5-pro (chat) 5.7 Imported 2026-05-27
14 BLSP (chat) 5.33 Imported 2026-05-27
15 Pandagpt (chat) 4.25 Imported 2026-05-27
16 Next-gpt (chat) 4.13 Imported 2026-05-27
17 SpeechGPT (chat) 1.15 Imported 2026-05-27
18 Macaw-LLM (chat) 1.01 Imported 2026-05-27