BFCL v2

Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.

5rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: bfcl_v2
Category: Tool Use
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Llama 3.3 70B Instruct	0.77	Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct	Self-reported	2026-05-06
2	Llama 3.1 Nemotron Ultra 253B v1	0.74	—	Self-reported	2026-05-06
3	Llama-3.3 Nemotron Super 49B v1	0.74	—	Self-reported	2026-05-06
4	Llama 3.2 3B Instruct	0.67	Llama 3.2 3B Instruct meta-llama-llama-3.2-3b-instruct	Self-reported	2026-05-06
5	Llama 3.1 Nemotron Nano 8B V1	0.64	—	Self-reported	2026-05-06

Metadata

Metrics

Latest Results