HELM MedQA

HELM MedQA: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.

13rows
exact_matchprimary metric
2026-05-28sampled

Metadata

Metrics

Exact match, Observed inference time (s) (lower is better), # eval

Latest Results

Rows are imported from the MedHELM public GCS MedQA group JSON. Exact match is reported as a percentage.

Rank Subject Exact match Model Match Provenance Sampled
1 GPT-5 (2025-08-07) 0.968191 GPT-5
openai-gpt-5
Imported 2026-05-28
2 GPT-5 mini (2025-08-07) 0.956262 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
3 o4-mini (2025-04-16) 0.948310 o4 Mini
openai-o4-mini
Imported 2026-05-28
4 Gemini 2.5 Pro (05-06 preview) 0.934394 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
5 o3-mini (2025-01-31) 0.920477 o3-mini
openai-o3-mini
Imported 2026-05-28
6 GPT-4o (2024-05-13) 0.876740 GPT-4o
openai-gpt-4o
Imported 2026-05-28
7 Claude 3.5 Sonnet (20241022) 0.864811 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
8 Claude 3.7 Sonnet (20250219) 0.856859 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-28
9 DeepSeek R1 0.856859 R1
deepseek-r1
Imported 2026-05-28
10 Gemini 2.0 Flash 0.848907 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-28
11 Llama 3.3 Instruct (70B) 0.801193 Imported 2026-05-28
12 Gemini 1.5 Pro (001) 0.769384 Imported 2026-05-28
13 GPT-4o mini (2024-07-18) 0.749503 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-28