Finance Agent v1.1

Evaluating agents on core financial analyst tasks

60rows
scoreprimary metric
2026-05-04sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 3 latest source slices.

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.7 64.373% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-04
2 Claude Sonnet 4.6 63.331% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-04
3 Muse Spark 60.595% Imported 2026-05-04
4 DeepSeek V4 Pro 60.389% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-04
5 Claude Opus 4.6 Thinking 60.046% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-04
6 GPT 5.5 59.963% GPT-5.5
openai-gpt-5.5
Imported 2026-05-04
7 Gemini 3.1 Pro Preview 59.717% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-04
8 Claude Opus 4.5 20251101 Thinking 58.81% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-04
9 GPT 5.2 2025-12-11 58.535% GPT-5.2
openai-gpt-5.2
Imported 2026-05-04
10 GLM 5.1 Thinking 57.655% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-04
11 GPT 5.4 2026-03-05 57.152% GPT-5.4
openai-gpt-5.4
Imported 2026-05-04
12 Kimi K2.6 Thinking 57.056% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-04
13 GPT 5.1 2025-11-13 55.309% GPT-5.1
openai-gpt-5.1
Imported 2026-05-04
14 Gemini 3 Pro Preview 55.154% Gemini 3
google-gemini-3
Imported 2026-05-04
15 Qwen 3.6 Plus 54.627% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-04
16 Claude Sonnet 4.5 20250929 Thinking 54.5% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-04
17 Qwen 3.5 Plus Thinking 54.475% Imported 2026-05-04
18 Grok 4.3 53.812% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-04
19 Grok 4.0709 53.506% GROK Grok 4
x-ai-grok-4
Imported 2026-05-04
20 GPT 5.4 Mini 2026-03-17 53.405% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-04
21 GLM 5 Thinking 53.182% GLM GLM 5
z-ai-glm-5
Imported 2026-05-04
22 Qwen 3.6 Max Preview 52.785% Qwen3.6 Max Preview
qwen-qwen3.6-max-preview
Imported 2026-05-04
23 Grok 4.1 Fast Reasoning 52.448% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-04
24 Grok 4.20 0309 Reasoning 52.295% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-04
25 GPT 5.2025-08-07 52.151% GPT-5
openai-gpt-5
Imported 2026-05-04
26 GPT 5 Mini 2025-08-07 51.928% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-04
27 Gemma 4 31B It 50.788% Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-04
28 Kimi K2.5 Thinking 50.622% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-04
29 MiniMax M2.7 48.402% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-04
30 GPT 5.4 Nano 2026-03-17 47.801% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-04
31 Gemini 3 Flash Preview 47.598% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-04
32 Claude Haiku 4.5 20251001 Thinking 46.931% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-04
33 Gemini 3.1 Flash Lite Preview 46.123% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-04
34 Mistral Medium 3.5 46.113% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-04
35 Grok 4 Fast Reasoning 46.084% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-04
36 GLM 4.7 45.977% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-04
37 Qwen 3.5 Flash 45.639% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-04
38 Grok 4.1 Fast Non Reasoning 44.362% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-04
39 Qwen 3 Max 44.295% Qwen3 Max
qwen-qwen3-max
Imported 2026-05-04
40 Gemini 2.5 Pro 41.589% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-04
41 MiniMax M2.5 Lightning 38.579% Imported 2026-05-04
42 Kimi K2 Thinking 36.647% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-04
43 GLM 4.6 36.48% GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-05-04
44 MiniMax M2.1 33.35% MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-05-04
45 GPT Oss 120B 21.541% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-04
46 Mistral Large 2512 18.049% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-04
47 GPT 4O 2024-08-06 8.064% GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Imported 2026-05-04
48 Command A 03 2025 4.226% C Command A
cohere-command-a
Imported 2026-05-04
49 DeepSeek V3P2 Thinking 2.345% Imported 2026-05-04
50 Jamba Large 1.7 0.37% AI21 Jamba Large 1.7
ai21-jamba-large-1.7
Imported 2026-05-04
51 DeepSeek V3P2 0% Imported 2026-05-04
1 Claude Opus 4.7 64.4% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-23
2 GPT-5.4 Pro 61.5% GPT-5.4 Pro
openai-gpt-5.4-pro
Launch post 2026-04-23
3 GPT-5.5 60% GPT-5.5
openai-gpt-5.5
Launch post 2026-04-23
4 Gemini 3.1 Pro Preview 59.7% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-23
5 GPT-5.4 56% GPT-5.4
openai-gpt-5.4
Launch post 2026-04-23
1 Claude Opus 4.7 64.4% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-16
2 GPT-5.4 Pro 61.5% GPT-5.4 Pro
openai-gpt-5.4-pro
Launch post 2026-04-16
3 Claude Opus 4.6 60.1% Claude Opus 4.6
anthropic-claude-opus-4.6
Launch post 2026-04-16
4 Gemini 3.1 Pro Preview 59.7% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Launch post 2026-04-16