Finance Agent v2
Evaluating agents on core financial analyst tasks using the FAB v2 harness
24rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 57.861% | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 53.918% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 3 | GPT 5.5 | 51.76% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 4 | Claude Opus 4.7 | 51.509% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 5 | Claude Sonnet 4.6 | 51.035% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 6 | Qwen 3.7 Max | 48.353% | Qwen3.7 Max qwen-qwen3.7-max | Imported | 2026-05-28 |
| 7 | GPT 5.4 Mini 2026-03-17 | 45.36% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-28 |
| 8 | Kimi K2.6 Thinking | 44.866% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 9 | GLM 5.1 Thinking | 44.792% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 10 | DeepSeek V4 Pro | 44.083% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 11 | Gemini 3.1 Pro Preview | 42.982% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 12 | Gemini 3 Flash Preview | 42.551% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 13 | Qwen 3.6 Plus | 40.846% | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-28 |
| 14 | GPT 5.4 Nano 2026-03-17 | 38.217% | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-28 |
| 15 | Grok 4.3 | 37.708% | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
| 16 | Mistral Medium 3.5 | 32.063% | Mistral: Mistral Medium 3.5 mistralai-mistral-medium-3-5 | Imported | 2026-05-28 |
| 17 | Claude Haiku 4.5 20251001 Thinking | 31.01% | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 18 | Gemini 3.1 Flash Lite Preview | 29.988% | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-28 |
| 19 | Grok 4.20 0309 Reasoning | 28.492% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 20 | MiniMax M2.7 | 27.887% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-28 |
| 1 | Claude Opus 4.8 | 53.9% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | GPT-5.5 | 51.8% | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 51.5% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 4 | Gemini 3.1 Pro Preview | 43% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
No matching rows.