ACEBench

ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.

2rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: acebench
Category: Tool Use
Release: 2025-01-22
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Kimi K2 Instruct	0.77	KIMI MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2	Self-reported	2026-05-06
1	Kimi K2-Instruct-0905	0.77	KIMI MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905	Self-reported	2026-05-06

Metadata

Metrics

Latest Results