BFCL_v3_MultiTurn

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.

2rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: bfcl_v3_multiturn
Category: Tool Use
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	MiniMax M2.5	0.77	MiniMax M2.5 minimax-minimax-m2.5	Self-reported	2026-05-06
2	Nemotron Nano 9B v2	0.67	Nemotron Nano 9B V2 nvidia-nemotron-nano-9b-v2	Self-reported	2026-05-06

Metadata

Metrics

Latest Results