AgentBench FC

Function-calling edition of AgentBench, evaluating LLM agents on ALFWorld, database, knowledge graph, operating-system, and WebShop environments using pass@1 success rates.

25rows
avg_pass_at_1primary metric
2026-05-06sampled

Metadata

Metrics

AVG Pass@1, ALFWorld Pass@1, DB Pass@1, KG Pass@1, OS Pass@1, WebShop Pass@1

Latest Results

Rows are imported from the public AgentBench FC Google Sheets CSV. Source model display names, organizations, release dates, and uncertainty columns are preserved.

Rank Subject AVG Pass@1 Model Match Provenance Sampled
1 AgentRL w/ Qwen2.5-32B-Instruct 70.40 Imported 2026-05-06
2 AgentRL w/ Qwen2.5-14B-Instruct 67.70 Imported 2026-05-06
3 AgentRL w/ GLM-4-9B-0414 65 Imported 2026-05-06
4 AgentRL w/ Qwen2.5-7B-Instruct 62 Imported 2026-05-06
5 AgentRL w/ Qwen2.5-3B-Instruct 60 Imported 2026-05-06
6 Claude Sonnet 4.5 (2025-09-29) 58.90 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
7 Claude Sonnet 4.5 Thinking (2025-09-29) 58.30 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
8 Claude Sonnet 4 Thinking (2025-05-14) 58.20 Imported 2026-05-06
9 Claude Sonnet 4 (2025-05-14) 57.40 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
10 Claude Sonnet 3.7 (2025-02-19) 53.20 Imported 2026-05-06
11 GPT-5 (2025-08-07) 52.20 GPT-5
openai-gpt-5
Imported 2026-05-06
12 AgentLM-70B 51.40 Imported 2026-05-06
13 Claude Sonnet 3.7 Thinking (2025-02-19) 50 Imported 2026-05-06
14 DeepSeek-R1 (2025-05-28) 49.30 R1 0528
deepseek-deepseek-r1-0528
Imported 2026-05-06
15 AgentLM-13B 45.10 Imported 2026-05-06
16 AgentLM-7B 42.70 Imported 2026-05-06
17 o3-mini (2025-01-31) 40.90 o3-mini
openai-o3-mini
Imported 2026-05-06
18 Qwen2.5-72B-Instruct 40.80 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
19 o4-mini (2025-04-16) 39.70 o4 Mini
openai-o4-mini
Imported 2026-05-06
20 GPT-4o (2024-11-20) 39.60 GPT-4o (2024-11-20)
openai-gpt-4o-2024-11-20
Imported 2026-05-06
21 Qwen2.5-32B-Instruct 37.20 Imported 2026-05-06
22 Hephaestus-8B-IFT 36.30 Imported 2026-05-06
23 DeepSeek-V3 (2025-03-24) 36.10 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
24 Hephaestus-8B-Base 31.90 Imported 2026-05-06
25 Qwen2.5-14B-Instruct 27.20 Imported 2026-05-06