TaxEval v2

A Vals-created set of questions and responses to tax questions

114rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Muse Spark 77.678% Imported 2026-05-28
2 Claude Sonnet 4.6 77.106% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
3 Claude Opus 4.6 Thinking 75.961% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
4 Grok 3 75.879% GROK Grok 3
xaigrok-3
Imported 2026-05-28
5 GPT 5.2 2025-12-11 75.756% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
6 Grok 4 Fast Reasoning 75.697% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
7 Claude Opus 4.8 75.634% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
8 Qwen 3.7 Max 75.306% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
9 Claude Opus 4.7 75.266% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
10 GPT 5 Mini 2025-08-07 75.225% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
11 GPT 4.1 2025-04-14 75.061% GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
12 GPT 5.5 74.98% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
13 GPT 5.1 2025-11-13 74.857% GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
14 Claude Opus 4.5 20251101 Thinking 74.856% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
15 O4 Mini 2025-04-16 74.776% o4 Mini
openai-o4-mini
Imported 2026-05-28
16 Qwen 3.6 Plus 74.734% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
17 Kimi K2.6 Thinking 74.652% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
18 O3 2025-04-16 74.571% o3
openai-o3
Imported 2026-05-28
19 GPT 4O 2024-11-20 74.53% GPT-4o (2024-11-20)
openai-gpt-4o-2024-11-20
Imported 2026-05-28
20 Gemini 3.5 Flash 74.366% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
21 Claude Opus 4.5 20251101 74.325% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
22 O1 2024-12-17 74.284% o1
openai-o1
Imported 2026-05-28
23 Kimi K2.5 Thinking 74.202% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
24 Grok 4.20 0309 Reasoning 74.121% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
25 Claude 3 7 Sonnet 20250219 Thinking 74.039% Imported 2026-05-28
26 Qwen 3 Max Preview 73.958% Imported 2026-05-28
27 GPT 5.4 2026-03-05 73.958% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
28 Gemini 3 Flash Preview 73.876% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
29 Claude Opus 4.1 20250805 Thinking 73.672% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
30 Qwen 3 Max 73.508% Qwen3 Max
qwen-qwen3-max
Imported 2026-05-28
31 GPT 5.2025-08-07 73.385% GPT-5
openai-gpt-5
Imported 2026-05-28
32 Claude Sonnet 4.5 20250929 Thinking 73.303% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
33 Grok 4.1 Fast Reasoning 73.14% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
34 Mistral Large 2512 73.058% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-28
35 Grok 3 Mini Fast Low Reasoning 72.976% Imported 2026-05-28
36 Gemini 2.5 Pro Exp 03 25 72.894% Imported 2026-05-28
37 Gemini 3.1 Pro Preview 72.882% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
38 Gemini 2.5 Flash Preview 09 2025 72.731% Imported 2026-05-28
39 Gemini 3 Pro Preview 72.568% Gemini 3
google-gemini-3
Imported 2026-05-28
40 Claude 3 7 Sonnet 20250219 72.404% Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-28
41 Gemini 2.5 Flash Preview 09 2025 Thinking 72.404% Imported 2026-05-28
42 GLM 4.5 72.404% GLM GLM 4.5
z-ai-glm-4.5
Imported 2026-05-28
43 DeepSeek R1 72.281% R1
deepseek-r1
Imported 2026-05-28
44 Qwen 3.5 Flash 72.158% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-28
45 DeepSeek V4 Pro 72.077% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
46 Claude Sonnet 4.20250514 Thinking 71.995% Imported 2026-05-28
47 Claude Opus 4.20250514 71.914% Claude Opus 4
anthropic-claude-opus-4
Imported 2026-05-28
48 GPT 4.1 Mini 2025-04-14 71.914% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-28
49 Gemini 3.1 Flash Lite Preview 71.79% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
50 Kimi K2 Thinking 71.709% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-28
51 GPT Oss 120B 71.586% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
52 Grok 4 Fast Non Reasoning 71.581% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
53 Claude Opus 4.1 20250805 71.464% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
54 Qwen 3.6 27B 71.26% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
55 GPT 5.4 Mini 2026-03-17 71.218% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
56 GLM 5.1 Thinking 71.194% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
57 Gemini 2.5 Flash Preview 04 17 71.178% Imported 2026-05-28
58 Grok 3 Mini Fast High Reasoning 71.136% Imported 2026-05-28
59 GPT 4O 2024-08-06 71.136% GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Imported 2026-05-28
60 DeepSeek V3 0324 71.096% DeepSeek V3 0324
deepseek-deepseek-chat-v3-0324
Imported 2026-05-28
61 Grok 4.3 70.81% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
62 Qwen 3 235B A22b 70.646% Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-28
63 Gemini 2.5 Flash Preview 04 17 Thinking 70.524% Imported 2026-05-28
64 Mistral Medium 2505 70.319% Imported 2026-05-28
65 Kimi K2 Instruct 70.196% KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-28
66 Claude 3 5 Sonnet 20241022 70.156% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
67 GLM 5 Thinking 70.033% GLM GLM 5
z-ai-glm-5
Imported 2026-05-28
68 Gemini 2.0 Flash Thinking Exp 01 21 69.788% Imported 2026-05-28
69 Claude Sonnet 4.20250514 69.624% Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-28
70 O3 Mini 2025-01-31 69.42% o3-mini
openai-o3-mini
Imported 2026-05-28
71 GLM 4.7 68.766% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
72 DeepSeek V3P2 Thinking 68.152% Imported 2026-05-28
73 Gemini 2.0 Pro Exp 02 05 68.152% Imported 2026-05-28
74 MiniMax M2.5 Lightning 68.152% Imported 2026-05-28
75 Mistral Medium 3.5 67.988% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
76 DeepSeek V3 67.907% DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-28
77 Gemini 2.0 Flash Exp 67.744% Imported 2026-05-28
78 Claude Haiku 4.5 20251001 Thinking 67.539% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
79 GPT 5.4 Nano 2026-03-17 67.416% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
80 GPT 5 Nano 2025-08-07 67.376% GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-28
81 Grok 2.1212 67.048% Imported 2026-05-28
82 Llama4 Maverick Instruct Basic 66.558% Imported 2026-05-28
83 MiniMax M2.7 66.558% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28
84 MiniMax M2.1 66.353% MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-05-28
85 GLM 4.6 66.235% GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-05-28
86 Gemini 2.5 Flash Lite Preview 09 2025 66.231% Gemini 2.5 Flash Lite Preview 09-2025
google-gemini-2.5-flash-lite-preview-09-2025
Imported 2026-05-28
87 Gemini 2.0 Flash 001 65.25% Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-28
88 Grok 4.0709 65.086% GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
89 Gemini 2.5 Flash Lite Preview 09 2025 Thinking 64.718% Imported 2026-05-28
90 Mistral Large 2411 63.778% Mistral Large 2411
mistralai-mistral-large-2411
Imported 2026-05-28
91 Llama 3.3 Nemotron Super 49B V1 42e84561 Thinking 63.736% Imported 2026-05-28
92 GPT Oss 20B 63.696% gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-28
93 Magistral Medium 2509 61.938% Imported 2026-05-28
94 Grok 4.1 Fast Non Reasoning 61.856% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
95 Command A 03 2025 61.366% C Command A
cohere-command-a
Imported 2026-05-28
96 Jamba Large 1.6 60.875% Imported 2026-05-28
97 Meta Llama 3.1 405B Instruct Turbo 60.875% Imported 2026-05-28
98 GPT 4.1 Nano 2025-04-14 60.752% GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-28
99 GPT 4O Mini 2024-07-18 60.548% GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-05-28
100 Magistral Small 2509 60.302% Imported 2026-05-28
101 Llama 3.3 Nemotron Super 49B V1 42e84561 60.22% Imported 2026-05-28
102 Gemini 1.5 Pro 002 59.485% Imported 2026-05-28
103 Llama 3.3 70B Instruct Turbo 59.444% Imported 2026-05-28
104 Mistral Small 2503 58.3% Imported 2026-05-28
105 Jamba 1.5 Large 58.176% Imported 2026-05-28
106 Claude 3 5 Haiku 20241022 57.359% Imported 2026-05-28
107 Meta Llama 3.1 70B Instruct Turbo 56.174% Imported 2026-05-28
108 Llama 4 Scout 17B 16E Instruct 55.192% Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-28
109 Command A Plus 05 2026 52.82% Imported 2026-05-28
110 Mistral Small 2402 49.142% Imported 2026-05-28
111 Gemini 1.5 Flash 002 48.202% Imported 2026-05-28
112 Jamba Mini 1.6 44.604% Imported 2026-05-28
113 Jamba 1.5 Mini 41.864% Imported 2026-05-28
114 Meta Llama 3.1 8B Instruct Turbo 32.338% Imported 2026-05-28