ComplexFuncBench

ComplexFuncBench is a benchmark designed to evaluate large language models' capabilities in handling complex function calling scenarios. It encompasses multi-step and constrained function calling tasks that require long-parameter filling, parameter value reasoning, and managing contexts up to 128k tokens. The benchmark includes 1,000 samples across five real-world scenarios.

6rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-4o 0.67 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
2 GPT-4.1 0.66 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
3 GPT-4.5 0.63 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
4 GPT-4.1 mini 0.49 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
5 o3-mini 0.18 o3-mini
openai-o3-mini
Self-reported 2026-05-06
6 GPT-4.1 nano 0.06 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06