WildAgtEval

Agent API-use benchmark measuring robustness to real-world API complexity across multi-turn conversations, API functions, and injected complexity scenarios.

10rows
complexity_injected_api_call_accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Complexity-Injected API Call Accuracy, Complexity-Absent API Call Accuracy, Average Degradation (lower is better)

Latest Results

Rows are imported from public arXiv source LaTeX. The source table reports average API-call accuracy with and without injected API complexity.

Rank Subject Complexity-Injected API Call Accuracy Model Match Provenance Sampled
1 Claude-4.0-Sonnet (Think) 67.5% Imported 2026-05-28
2 Claude-4.0-Sonnet 63.6% Imported 2026-05-28
3 Qwen3-235B-Thinking 62.6% Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-28
4 GPT-OSS-120B 62.5% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
5 Claude-3.7-Sonnet 61.6% Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-28
6 Qwen3-235B-Instruct 58.7% Imported 2026-05-28
7 Claude-3.5-Sonnet 55.8% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
8 Qwen3-32B 49.9% Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-28
9 Mistral-24B-Inst 47.6% Imported 2026-05-28
10 DeepSeek-R1-Qwen32B 26% Imported 2026-05-28