VoiceAgentBench

Spoken tool-use agent benchmark for speech-in agents performing tool selection, parameter filling, orchestration, multi-turn handling, and safety checks.

6rows
english_parameter_filling_averageprimary metric
2026-05-27sampled

Metadata

Metrics

English PF average, English TS average, English TCS average, Multilingual PF average, Multilingual TS average, Multilingual TCS average, English refusal rate, Multilingual refusal rate

Latest Results

Rows are transcribed from the public VoiceAgentBench arXiv paper Tables 2, 3, and 4. Primary score is English average parameter-filling accuracy.

Rank Subject English PF average Model Match Provenance Sampled
1 Whisperv3-Llama3 70B 60.64% Imported 2026-05-27
2 Whisperv3-Gemma3 27B 59.28% Imported 2026-05-27
3 KimiAudio 7B 57.57% Imported 2026-05-27
4 Whisperv3-Qwen3 8B 56.26% Imported 2026-05-27
5 AudioFlamingo3 7B 19.71% Imported 2026-05-27
6 Qwen2.5-Omni 7B 1.7% Imported 2026-05-27