Galileo Agent Leaderboard
Agentic task leaderboard ranking LLMs across banking, healthcare, insurance, investment, and telecom workflows with accuracy, trajectory quality, cost, latency, and turn-count metrics.
Metadata
Metrics
Avg AC, Avg TSQ, Avg Total Cost (lower is better), Avg Session Duration (lower is better), Avg Turns (lower is better), Banking AC, Banking TSQ, Banking Cost (lower is better), Banking Duration (lower is better), Banking Turns (lower is better), Healthcare AC, Healthcare TSQ, Healthcare Cost (lower is better), Healthcare Duration (lower is better), Healthcare Turns (lower is better), Insurance AC, Insurance TSQ, Insurance Cost (lower is better), Insurance Duration (lower is better), Insurance Turns (lower is better), Investment AC, Investment TSQ, Investment Cost (lower is better), Investment Duration (lower is better), Investment Turns (lower is better), Telecom AC, Telecom TSQ, Telecom Cost (lower is better), Telecom Duration (lower is better), Telecom Turns (lower is better), Avg Input Cost (lower is better), Avg Output Cost (lower is better)
| Rank | Subject | Avg AC | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-4.1-2025-04-14 | 0.62 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 2 | mistral-medium-2508 | 0.61 | — | Imported | 2026-05-06 |
| 3 | gpt-4.1-mini-2025-04-14 | 0.56 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-06 |
| 4 | claude-sonnet-4-20250514 | 0.55 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 5 | kimi-k2-instruct | 0.53 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-05-06 |
| 6 | qwen3-235b-a22b-instruct-2507 | 0.53 | Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507 | Imported | 2026-05-06 |
| 7 | qwen2.5-72b-instruct | 0.51 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-06 |
| 8 | gemini-2.5-flash-lite | 0.47 | Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite | Imported | 2026-05-06 |
| 9 | glm-4.5-air | 0.44 | GLM 4.5 Air z-ai-glm-4.5-air | Imported | 2026-05-06 |
| 10 | gemini-2.5-pro | 0.43 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 11 | grok-4-0709 | 0.42 | Grok 4 x-ai-grok-4 | Imported | 2026-05-06 |
| 12 | deepseek-v3 | 0.40 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
| 13 | gemini-2.5-flash | 0.38 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 14 | gpt-4.1-nano-2025-04-14 | 0.38 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-06 |
| 15 | qwen3-235b-a22b-thinking-2507 | 0.34 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Imported | 2026-05-06 |
| 16 | magistral-medium-2506 | 0.32 | — | Imported | 2026-05-06 |
| 17 | nova-pro-v1 | 0.29 | Nova Pro 1.0 amazon-nova-pro-v1 | Imported | 2026-05-06 |
| 18 | mistral-small-2506 | 0.26 | — | Imported | 2026-05-06 |
| 19 | llama-3.3-70b-instruct | 0.20 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-06 |
| 20 | caller | 0.16 | — | Imported | 2026-05-06 |
| 21 | nova-lite-v1 | 0.16 | Nova Lite 1.0 amazon-nova-lite-v1 | Imported | 2026-05-06 |
| 22 | magistral-small-2506 | 0.16 | — | Imported | 2026-05-06 |
No matching rows.