MedQA

Evaluating language model bias in medical questions.

95rows
scoreprimary metric
2026-04-16sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 O1 2024-12-17 96.517% o1
openai-o1
Imported 2026-04-16
2 GPT 5.1 2025-11-13 96.383% GPT-5.1
openai-gpt-5.1
Imported 2026-04-16
3 Gemini 3.1 Pro Preview 96.367% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-04-16
4 GPT 5.2025-08-07 96.317% GPT-5
openai-gpt-5
Imported 2026-04-16
5 GPT 5.4 2026-03-05 96.092% GPT-5.4
openai-gpt-5.4
Imported 2026-04-16
6 GPT 5 Mini 2025-08-07 96.058% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-04-16
7 O3 2025-04-16 96.058% o3
openai-o3
Imported 2026-04-16
8 Gemini 3 Pro Preview 96.033% Gemini 3
google-gemini-3
Imported 2026-04-16
9 O4 Mini 2025-04-16 96.017% o4 Mini
openai-o4-mini
Imported 2026-04-16
10 Claude Opus 4.5 20251101 Thinking 95.875% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-04-16
11 Gemini 3 Flash Preview 95.808% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-04-16
12 Claude Opus 4.6 Thinking 95.408% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-04-16
13 Qwen 3.5 Plus Thinking 95.208% Imported 2026-04-16
14 O3 Mini 2025-01-31 94.833% o3-mini
openai-o3-mini
Imported 2026-04-16
15 Claude Sonnet 4.5 20250929 Thinking 94.708% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-04-16
16 Grok 4.20 0309 Reasoning 94.55% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-04-16
17 Kimi K2.5 Thinking 94.367% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-04-16
18 GLM 5 Thinking 94.267% GLM GLM 5
z-ai-glm-5
Imported 2026-04-16
19 GPT 5.2 2025-12-11 94.133% GPT-5.2
openai-gpt-5.2
Imported 2026-04-16
20 DeepSeek V3P2 Thinking 93.917% Imported 2026-04-16
21 GLM 4.7 93.742% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-04-16
22 Claude Opus 4.1 20250805 Thinking 93.592% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-04-16
23 GPT 5 Nano 2025-08-07 93.258% GPT-5 Nano
openai-gpt-5-nano
Imported 2026-04-16
24 Claude Opus 4.5 20251101 93.158% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-04-16
25 Gemini 2.5 Pro Exp 03 25 93.142% Imported 2026-04-16
26 O1 Preview 2024-09-12 93.008% Imported 2026-04-16
27 Claude Opus 4.20250514 92.867% Claude Opus 4
anthropic-claude-opus-4
Imported 2026-04-16
28 Claude Sonnet 4.20250514 Thinking 92.708% Imported 2026-04-16
29 Kimi K2 Thinking 92.592% KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-04-16
30 Claude Opus 4.1 20250805 92.533% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-04-16
31 MiniMax M2.5 Lightning 92.525% Imported 2026-04-16
32 Grok 4.0709 92.492% GROK Grok 4
x-ai-grok-4
Imported 2026-04-16
33 Grok 2.1212 92.317% Imported 2026-04-16
34 GLM 4.6 92.225% GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-04-16
35 Grok 4.1 Fast Reasoning 92.083% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-04-16
36 Grok 4 Fast Reasoning 92.067% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-04-16
37 Claude Sonnet 4.6 92.058% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-04-16
38 Gemini 2.5 Flash Preview 09 2025 91.433% Imported 2026-04-16
39 GPT Oss 120B 91.36% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-04-16
40 GPT 4.1 2025-04-14 91.183% GPT-4.1
openai-gpt-4.1
Imported 2026-04-16
41 Gemini 2.5 Flash Preview 09 2025 Thinking 91.167% Imported 2026-04-16
42 MiniMax M2.1 91.158% MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-04-16
43 Gemini 2.5 Flash Preview 04 17 Thinking 91.017% Imported 2026-04-16
44 DeepSeek R1 90.8% R1
deepseek-r1
Imported 2026-04-16
45 Qwen 3 235B A22b 90.617% Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-04-16
46 Claude Sonnet 4.20250514 90.35% Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-04-16
47 Claude 3 7 Sonnet 20250219 Thinking 90.217% Imported 2026-04-16
48 O1 Mini 2024-09-12 90.217% Imported 2026-04-16
49 Grok 3 Mini Fast High Reasoning 90.1% Imported 2026-04-16
50 GLM 4.5 89.975% GLM GLM 4.5
z-ai-glm-4.5
Imported 2026-04-16
51 Magistral Medium 2509 89.467% Imported 2026-04-16
52 DeepSeek V3P2 89.45% Imported 2026-04-16
53 Gemini 2.5 Flash Lite Preview 09 2025 Thinking 88.867% Imported 2026-04-16
54 Grok 3 Mini Fast Low Reasoning 88.65% Imported 2026-04-16
55 Meta Llama 3.1 405B Instruct Turbo 88.242% Imported 2026-04-16
56 GPT 4O 2024-08-06 88.161% GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Imported 2026-04-16
57 Qwen 3 Max Preview 87.375% Imported 2026-04-16
58 Qwen 3 Max 87.367% Qwen3 Max
qwen-qwen3-max
Imported 2026-04-16
59 Gemini 2.5 Flash Preview 04 17 86.733% Imported 2026-04-16
60 Meta Llama 3.1 70B Instruct Turbo 84.784% Imported 2026-04-16
61 GPT 4.1 Mini 2025-04-14 84.633% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-04-16
62 Kimi K2 Instruct 83.975% KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-04-16
63 Grok 3 83.85% GROK Grok 3
xaigrok-3
Imported 2026-04-16
64 Claude 3 5 Sonnet 20241022 83.191% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-04-16
65 GPT Oss 20B 82.875% gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-04-16
66 Magistral Small 2509 82.358% Imported 2026-04-16
67 Mistral Large 2512 82.233% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-04-16
68 DeepSeek V3 0324 82% DeepSeek V3 0324
deepseek-deepseek-chat-v3-0324
Imported 2026-04-16
69 GPT 4 Turbo 81.986% GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-04-16
70 Gemini 2.0 Flash 001 81.467% Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-04-16
71 DeepSeek V3 80.9% DeepSeek V3
deepseek-deepseek-chat
Imported 2026-04-16
72 Command A 03 2025 80.55% C Command A
cohere-command-a
Imported 2026-04-16
73 Gemini 2.5 Flash Lite Preview 09 2025 80.325% Gemini 2.5 Flash Lite Preview 09-2025
google-gemini-2.5-flash-lite-preview-09-2025
Imported 2026-04-16
74 Claude Haiku 4.5 20251001 Thinking 79.567% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-04-16
75 Mistral Medium 2505 78.233% Imported 2026-04-16
76 Qwen 2.5 72B Instruct Turbo 77.395% Imported 2026-04-16
77 Gemini 1.5 Pro 002 76.53% Imported 2026-04-16
78 Mistral Large 2411 76.225% Mistral Large 2411
mistralai-mistral-large-2411
Imported 2026-04-16
79 Grok 4.1 Fast Non Reasoning 76.025% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-04-16
80 Grok 4 Fast Non Reasoning 75.358% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-04-16
81 GPT 4O Mini 2024-07-18 72.436% GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-04-16
82 Mistral Small 2503 69.1% Imported 2026-04-16
83 GPT 4.1 Nano 2025-04-14 68.225% GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-04-16
84 Jamba 1.5 Large 68.108% Imported 2026-04-16
85 Meta Llama 3.1 8B Instruct Turbo 62.614% Imported 2026-04-16
86 Mixtral 8x22B Instruct V0.1 62.139% Imported 2026-04-16
87 GPT 3.5 Turbo 58.471% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-04-16
88 Mistral Small 2402 56.983% Imported 2026-04-16
89 Jamba 1.5 Mini 55.183% Imported 2026-04-16
90 Mixtral 8x7B V0.1 53.218% Imported 2026-04-16
91 Jamba Mini 1.6 52.517% Imported 2026-04-16
92 Llama 4 Scout 17B 16E Instruct 50.9% Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-04-16
93 Jamba Large 1.6 50.7% Imported 2026-04-16
94 Llama4 Maverick Instruct Basic 43.3% Imported 2026-04-16
95 Command R Plus 2.651% Imported 2026-04-16