Finance Agent v2

Evaluating agents on core financial analyst tasks using the FAB v2 harness

24rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Showing 2 latest source slices.

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Gemini 3.5 Flash 57.861% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
2 Claude Opus 4.8 53.918% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
3 GPT 5.5 51.76% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
4 Claude Opus 4.7 51.509% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
5 Claude Sonnet 4.6 51.035% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
6 Qwen 3.7 Max 48.353% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
7 GPT 5.4 Mini 2026-03-17 45.36% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
8 Kimi K2.6 Thinking 44.866% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
9 GLM 5.1 Thinking 44.792% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
10 DeepSeek V4 Pro 44.083% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
11 Gemini 3.1 Pro Preview 42.982% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
12 Gemini 3 Flash Preview 42.551% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
13 Qwen 3.6 Plus 40.846% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
14 GPT 5.4 Nano 2026-03-17 38.217% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
15 Grok 4.3 37.708% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
16 Mistral Medium 3.5 32.063% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
17 Claude Haiku 4.5 20251001 Thinking 31.01% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
18 Gemini 3.1 Flash Lite Preview 29.988% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
19 Grok 4.20 0309 Reasoning 28.492% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
20 MiniMax M2.7 27.887% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
1 Claude Opus 4.8 53.9% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
2 GPT-5.5 51.8% GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-28
3 Claude Opus 4.7 51.5% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
4 Gemini 3.1 Pro Preview 43% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-28