BFCL_v3_MultiTurn

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.

2rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 MiniMax M2.5 0.77 MiniMax M2.5
minimax-minimax-m2.5
Self-reported 2026-05-06
2 Nemotron Nano 9B v2 0.67 Nemotron Nano 9B V2
nvidia-nemotron-nano-9b-v2
Self-reported 2026-05-06