WildAgtEval
Agent API-use benchmark measuring robustness to real-world API complexity across multi-turn conversations, API functions, and injected complexity scenarios.
10rows
complexity_injected_api_call_accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Complexity-Injected API Call Accuracy, Complexity-Absent API Call Accuracy, Average Degradation (lower is better)
| Rank | Subject | Complexity-Injected API Call Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude-4.0-Sonnet (Think) | 67.5% | — | Imported | 2026-05-28 |
| 2 | Claude-4.0-Sonnet | 63.6% | — | Imported | 2026-05-28 |
| 3 | Qwen3-235B-Thinking | 62.6% | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Imported | 2026-05-28 |
| 4 | GPT-OSS-120B | 62.5% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-28 |
| 5 | Claude-3.7-Sonnet | 61.6% | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-28 |
| 6 | Qwen3-235B-Instruct | 58.7% | — | Imported | 2026-05-28 |
| 7 | Claude-3.5-Sonnet | 55.8% | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-28 |
| 8 | Qwen3-32B | 49.9% | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-28 |
| 9 | Mistral-24B-Inst | 47.6% | — | Imported | 2026-05-28 |
| 10 | DeepSeek-R1-Qwen32B | 26% | — | Imported | 2026-05-28 |
No matching rows.