Galileo Agent Leaderboard

Agentic task leaderboard ranking LLMs across banking, healthcare, insurance, investment, and telecom workflows with accuracy, trajectory quality, cost, latency, and turn-count metrics.

22rows
avg_acprimary metric
2026-05-06sampled

Metadata

Metrics

Avg AC, Avg TSQ, Avg Total Cost (lower is better), Avg Session Duration (lower is better), Avg Turns (lower is better), Banking AC, Banking TSQ, Banking Cost (lower is better), Banking Duration (lower is better), Banking Turns (lower is better), Healthcare AC, Healthcare TSQ, Healthcare Cost (lower is better), Healthcare Duration (lower is better), Healthcare Turns (lower is better), Insurance AC, Insurance TSQ, Insurance Cost (lower is better), Insurance Duration (lower is better), Insurance Turns (lower is better), Investment AC, Investment TSQ, Investment Cost (lower is better), Investment Duration (lower is better), Investment Turns (lower is better), Telecom AC, Telecom TSQ, Telecom Cost (lower is better), Telecom Duration (lower is better), Telecom Turns (lower is better), Avg Input Cost (lower is better), Avg Output Cost (lower is better)

Latest Results

Rows are parsed from Galileo's public Agent Leaderboard results_v2.csv. Source model, vendor, pricing, output type, and release-date strings are preserved.

Rank Subject Avg AC Model Match Provenance Sampled
1 gpt-4.1-2025-04-14 0.62 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
2 mistral-medium-2508 0.61 Imported 2026-05-06
3 gpt-4.1-mini-2025-04-14 0.56 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-06
4 claude-sonnet-4-20250514 0.55 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
5 kimi-k2-instruct 0.53 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-06
6 qwen3-235b-a22b-instruct-2507 0.53 Qwen3 235B A22B Instruct 2507
qwen-qwen3-235b-a22b-2507
Imported 2026-05-06
7 qwen2.5-72b-instruct 0.51 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
8 gemini-2.5-flash-lite 0.47 Gemini 2.5 Flash Lite
google-gemini-2.5-flash-lite
Imported 2026-05-06
9 glm-4.5-air 0.44 GLM GLM 4.5 Air
z-ai-glm-4.5-air
Imported 2026-05-06
10 gemini-2.5-pro 0.43 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
11 grok-4-0709 0.42 GROK Grok 4
x-ai-grok-4
Imported 2026-05-06
12 deepseek-v3 0.40 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
13 gemini-2.5-flash 0.38 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
14 gpt-4.1-nano-2025-04-14 0.38 GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-06
15 qwen3-235b-a22b-thinking-2507 0.34 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-06
16 magistral-medium-2506 0.32 Imported 2026-05-06
17 nova-pro-v1 0.29 Nova Pro 1.0
amazon-nova-pro-v1
Imported 2026-05-06
18 mistral-small-2506 0.26 Imported 2026-05-06
19 llama-3.3-70b-instruct 0.20 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
20 caller 0.16 Imported 2026-05-06
21 nova-lite-v1 0.16 Nova Lite 1.0
amazon-nova-lite-v1
Imported 2026-05-06
22 magistral-small-2506 0.16 Imported 2026-05-06