MultiMedQA

MultiMedQA: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.

5rows
mean_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Mean accuracy across reported MultiMedQA components, MedQA Mainland China, MedQA Taiwan, MedQA United States (5-option), MedQA United States (4-option), PubMedQA Reasoning Required, MedMCQA Dev, MMLU Clinical Knowledge, MMLU Medical Genetics, MMLU Anatomy, MMLU Professional Medicine, MMLU College Biology, MMLU College Medicine

Latest Results

Rows are transcribed from public GPT-4 medical challenge problems Table 4. Primary score is a BenchmarkList-derived mean over reported multiple-choice MultiMedQA component accuracies.

Rank Subject Mean accuracy across reported MultiMedQA components Model Match Provenance Sampled
1 GPT-4 (5-shot) 82.405833% GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-27
2 GPT-4 (zero-shot) 81.134167% GPT-4
openai-gpt-4
Imported 2026-05-27
3 Flan-PaLM 540B (few-shot) 72.133333% Imported 2026-05-27
4 GPT-3.5 (5-shot) 59.518333% Imported 2026-05-27
5 GPT-3.5 (zero-shot) 58.9875% Imported 2026-05-27