MortgageTax

Evaluating reading and understanding tax certificates as images

76rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.7 70.27% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
2 Claude Opus 4.8 69.912% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
3 Gemini 3.1 Pro Preview 69.396% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
4 Gemini 3 Pro Preview 69.078% Gemini 3
google-gemini-3
Imported 2026-05-28
5 Gemini 2.5 Pro 68.918% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
6 GPT 5.5 68.76% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
7 Gemini 3 Flash Preview 68.72% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
8 Claude 3 7 Sonnet 20250219 68.68% Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-28
9 Claude Opus 4.5 20251101 68.68% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
10 Claude Opus 4.6 Thinking 68.522% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
11 GPT 5.4 2026-03-05 68.323% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
12 Qwen 3.6 27B 68.283% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
13 Gemini 3.5 Flash 68.124% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
14 Gemini 3.1 Flash Lite Preview 68.044% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
15 Qwen 3.6 Plus 67.965% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
16 Claude Sonnet 4.6 67.726% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
17 Claude Opus 4.5 20251101 Thinking 67.686% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
18 Qwen 3.5 Flash 67.368% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-28
19 Gemini 2.5 Pro Exp 03 25 67.17% Imported 2026-05-28
20 GPT 5.2 2025-12-11 67.13% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
21 GPT 5 Mini 2025-08-07 66.892% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
22 Claude 3 7 Sonnet 20250219 Thinking 66.852% Imported 2026-05-28
23 Kimi K2.5 Thinking 66.534% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
24 GPT 4.1 2025-04-14 65.938% GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
25 Kimi K2.6 Thinking 65.818% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
26 O3 2025-04-16 65.7% o3
openai-o3
Imported 2026-05-28
27 GPT 4.1 Mini 2025-04-14 65.501% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-28
28 GPT 5.2025-08-07 65.454% GPT-5
openai-gpt-5
Imported 2026-05-28
29 O4 Mini 2025-04-16 64.826% o4 Mini
openai-o4-mini
Imported 2026-05-28
30 Claude 3 5 Sonnet 20241022 64.07% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
31 Claude Sonnet 4.5 20250929 Thinking 63.99% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
32 GPT 5.4 Mini 2026-03-17 63.514% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
33 Gemini 2.5 Flash Preview 09 2025 62.599% Imported 2026-05-28
34 Claude Sonnet 4.20250514 62.468% Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-28
35 Claude Haiku 4.5 20251001 Thinking 62.162% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
36 Magistral Small 2509 62.122% Imported 2026-05-28
37 Gemini 2.5 Flash Preview 09 2025 Thinking 61.924% Imported 2026-05-28
38 Gemma 4 31B It 61.368% Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-28
39 GPT 5.1 2025-11-13 61.368% GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
40 Mistral Small 2503 61.208% Imported 2026-05-28
41 Claude Opus 4.1 20250805 61.089% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
42 GPT 4O 2024-08-06 60.97% GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Imported 2026-05-28
43 Qwen 3.5 Plus Thinking 60.772% Imported 2026-05-28
44 Gemini 2.0 Flash Exp 60.374% Imported 2026-05-28
45 Gemini 2.0 Flash 001 59.658% Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-28
46 GPT 5.4 Nano 2026-03-17 59.102% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
47 Claude Opus 4.20250514 58.586% Claude Opus 4
anthropic-claude-opus-4
Imported 2026-05-28
48 Llama4 Maverick Instruct Basic 58.506% Imported 2026-05-28
49 Gemini 2.5 Flash Preview 04 17 58.426% Imported 2026-05-28
50 Llama 4 Scout 17B 16E Instruct 57.75% Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-28
51 Gemini 2.5 Flash Lite Preview 09 2025 Thinking 57.552% Imported 2026-05-28
52 Gemini 2.5 Flash Preview 04 17 Thinking 57.512% Imported 2026-05-28
53 GPT 4O 2024-11-20 57.432% GPT-4o (2024-11-20)
openai-gpt-4o-2024-11-20
Imported 2026-05-28
54 Gemini 1.5 Pro 002 56.756% Imported 2026-05-28
55 Claude Opus 4.1 20250805 Thinking 56.121% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
56 Gemini 2.5 Flash Lite Preview 09 2025 55.803% Gemini 2.5 Flash Lite Preview 09-2025
google-gemini-2.5-flash-lite-preview-09-2025
Imported 2026-05-28
57 GPT 4O Mini 2024-07-18 54.492% GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-05-28
58 GPT 5 Nano 2025-08-07 53.617% GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-28
59 Command A Plus 05 2026 53.02% Imported 2026-05-28
60 GPT 4.1 Nano 2025-04-14 52.822% GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-28
61 Mistral Large 2512 52.106% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-28
62 Magistral Medium 2509 51.868% Imported 2026-05-28
63 Claude Sonnet 4.20250514 Thinking 49.88% Imported 2026-05-28
64 Grok 4.3 48.252% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
65 Grok 4.20 0309 Reasoning 45.35% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
66 Grok 4.0709 44.475% GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
67 Gemini 1.5 Flash 002 42.766% Imported 2026-05-28
68 Grok 4.1 Fast Reasoning 42.607% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
69 Grok 4 Fast Reasoning 42.09% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
70 Llama 3.2 90B Vision Instruct Turbo 38.792% Imported 2026-05-28
71 Mistral Medium 2505 36.446% Imported 2026-05-28
72 Grok 4.1 Fast Non Reasoning 34.26% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
73 Mistral Medium 3.5 28.895% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
74 Grok 4 Fast Non Reasoning 25.318% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
75 Llama 3.2 11B Vision Instruct Turbo 23.291% Imported 2026-05-28
76 Grok 2 Vision 1212 3.458% Imported 2026-05-28