AgentBench FC
Function-calling edition of AgentBench, evaluating LLM agents on ALFWorld, database, knowledge graph, operating-system, and WebShop environments using pass@1 success rates.
25rows
avg_pass_at_1primary metric
2026-05-06sampled
Metadata
Metrics
AVG Pass@1, ALFWorld Pass@1, DB Pass@1, KG Pass@1, OS Pass@1, WebShop Pass@1
| Rank | Subject | AVG Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | AgentRL w/ Qwen2.5-32B-Instruct | 70.40 | — | Imported | 2026-05-06 |
| 2 | AgentRL w/ Qwen2.5-14B-Instruct | 67.70 | — | Imported | 2026-05-06 |
| 3 | AgentRL w/ GLM-4-9B-0414 | 65 | — | Imported | 2026-05-06 |
| 4 | AgentRL w/ Qwen2.5-7B-Instruct | 62 | — | Imported | 2026-05-06 |
| 5 | AgentRL w/ Qwen2.5-3B-Instruct | 60 | — | Imported | 2026-05-06 |
| 6 | Claude Sonnet 4.5 (2025-09-29) | 58.90 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 7 | Claude Sonnet 4.5 Thinking (2025-09-29) | 58.30 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 8 | Claude Sonnet 4 Thinking (2025-05-14) | 58.20 | — | Imported | 2026-05-06 |
| 9 | Claude Sonnet 4 (2025-05-14) | 57.40 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 10 | Claude Sonnet 3.7 (2025-02-19) | 53.20 | — | Imported | 2026-05-06 |
| 11 | GPT-5 (2025-08-07) | 52.20 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 12 | AgentLM-70B | 51.40 | — | Imported | 2026-05-06 |
| 13 | Claude Sonnet 3.7 Thinking (2025-02-19) | 50 | — | Imported | 2026-05-06 |
| 14 | DeepSeek-R1 (2025-05-28) | 49.30 | R1 0528 deepseek-deepseek-r1-0528 | Imported | 2026-05-06 |
| 15 | AgentLM-13B | 45.10 | — | Imported | 2026-05-06 |
| 16 | AgentLM-7B | 42.70 | — | Imported | 2026-05-06 |
| 17 | o3-mini (2025-01-31) | 40.90 | o3-mini openai-o3-mini | Imported | 2026-05-06 |
| 18 | Qwen2.5-72B-Instruct | 40.80 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-06 |
| 19 | o4-mini (2025-04-16) | 39.70 | o4 Mini openai-o4-mini | Imported | 2026-05-06 |
| 20 | GPT-4o (2024-11-20) | 39.60 | GPT-4o (2024-11-20) openai-gpt-4o-2024-11-20 | Imported | 2026-05-06 |
| 21 | Qwen2.5-32B-Instruct | 37.20 | — | Imported | 2026-05-06 |
| 22 | Hephaestus-8B-IFT | 36.30 | — | Imported | 2026-05-06 |
| 23 | DeepSeek-V3 (2025-03-24) | 36.10 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
| 24 | Hephaestus-8B-Base | 31.90 | — | Imported | 2026-05-06 |
| 25 | Qwen2.5-14B-Instruct | 27.20 | — | Imported | 2026-05-06 |
No matching rows.