MEDIC Benchmark

Clinical LLM benchmark leaderboard spanning closed-ended medical QA, open-ended clinical tasks, medical safety, summarization, note generation, HealthBench, EHRSQL, MedCalc, MedEC, general-domain, and DischargeMe evaluations.

96rows
medic_public_table_averageprimary metric
2026-05-27sampled

Metadata

Metrics

MEDIC Public Table Average, clinical_response_overall: ELO, clinical_response_overall: ELO 95% CI (lower is better), clinical_response_overall: Score, clinical_response_overall: Score 95% CI (lower is better), clinical_response_medicationqa: ELO, clinical_response_medicationqa: ELO 95% CI (lower is better), clinical_response_medicationqa: Score, clinical_response_medicationqa: Score 95% CI (lower is better), clinical_response_healthsearchqa: ELO, clinical_response_healthsearchqa: ELO 95% CI (lower is better), clinical_response_healthsearchqa: Score, clinical_response_healthsearchqa: Score 95% CI (lower is better), clinical_response_liveqa: ELO, clinical_response_liveqa: ELO 95% CI (lower is better), clinical_response_liveqa: Score, clinical_response_liveqa: Score 95% CI (lower is better), clinical_response_medquad: ELO, clinical_response_medquad: ELO 95% CI (lower is better), clinical_response_medquad: Score, clinical_response_medquad: Score 95% CI (lower is better), clinical_response_mtsamples: ELO, clinical_response_mtsamples: ELO 95% CI (lower is better), clinical_response_mtsamples: Score, clinical_response_mtsamples: Score 95% CI (lower is better), clinical_response_misc: ELO, clinical_response_misc: ELO 95% CI (lower is better), clinical_response_misc: Score, clinical_response_misc: Score 95% CI (lower is better), note_generation: Overall Score, note_generation: Coverage, note_generation: Conformity, note_generation: Consistency, note_generation: Conciseness, medical_summarization: Overall Score, medical_summarization: Coverage, medical_summarization: Conformity, medical_summarization: Consistency, dischargeme: Overall Score, dischargeme: Coverage, dischargeme: Conformity, dischargeme: Consistency, healthbench_hard: Overall Score, healthbench_hard: Responding under uncertainty, healthbench_hard: Health data tasks, healthbench_hard: Global health, healthbench_hard: Expertise-tailored communication, healthbench_hard: Context seeking, healthbench_hard: Emergency referrals, healthbench_hard: Response depth, healthbench_consensus_hard: Overall Score, healthbench_consensus_hard: Responding under uncertainty, healthbench_consensus_hard: Health data tasks, healthbench_consensus_hard: Global health, healthbench_consensus_hard: Expertise-tailored communication, healthbench_consensus_hard: Context seeking, healthbench_consensus_hard: Emergency referrals, healthbench_consensus_hard: Response depth, med_safety: Harmfulness Score (lower is better), med_safety: 95% CI (lower is better), closed_ended_medical_qa: Average, closed_ended_medical_qa: MMLU, closed_ended_medical_qa: MMLU-Pro, closed_ended_medical_qa: MedMCQA, closed_ended_medical_qa: MedQA, closed_ended_medical_qa: USMLE, closed_ended_medical_qa: PubMedQA, closed_ended_medical_qa: ToxiGen, multilingual_medical_qa: Average, multilingual_medical_qa: 🇦🇪 Arabic, multilingual_medical_qa: 🇫🇷 French, multilingual_medical_qa: 🇪🇸 Spanish, multilingual_medical_qa: 🇵🇹 Portuguese, multilingual_medical_qa: 🇷🇴 Romanian, multilingual_medical_qa: 🇬🇷 Greek, ehrsql: RS (0), ehrsql: Abstains correct %, ehrsql: Abstains incorrect %, ehrsql: Abstains failed %, ehrsql_heldout: RS (0), ehrsql_heldout: Abstains correct %, ehrsql_heldout: Abstains incorrect %, ehrsql_heldout: Abstains failed %, medcalc: Lab, medcalc: Risk, medcalc: Physical, medcalc: Severity, medcalc: Diagnosis, medcalc: Date, medcalc: Dosage, medcalc: Overall, medcalc_qwen32b: Lab, medcalc_qwen32b: Risk, medcalc_qwen32b: Physical, medcalc_qwen32b: Severity, medcalc_qwen32b: Diagnosis, medcalc_qwen32b: Date, medcalc_qwen32b: Dosage, medcalc_qwen32b: Overall, medcalc_qwen14b: Lab, medcalc_qwen14b: Risk, medcalc_qwen14b: Physical, medcalc_qwen14b: Severity, medcalc_qwen14b: Diagnosis, medcalc_qwen14b: Date, medcalc_qwen14b: Dosage, medcalc_qwen14b: Overall, medec_ms: Error Flag Accuracy (%), medec_ms: Error Sentence ID Accuracy (%), medec_ms: Invalid Responses (%) (lower is better), medec_full: Error Flag Accuracy (%), medec_full: Error Sentence ID Accuracy (%), medec_full: Invalid Responses (%) (lower is better)

Latest Results

Rows are parsed from public DataFrame values embedded in the MEDIC Benchmark Hugging Face Space config. Composite scores are normalized averages for ranking only; table-level metrics remain available in each row.

Rank Subject MEDIC Public Table Average Model Match Provenance Sampled
1 google/gemini-2.5-flash-preview-04-17-thinking 92.36 average normalized public table score — Imported 2026-05-27
2 openai/gpt-4.1 91.71 average normalized public table score GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
3 openai/o4-mini 90.5 average normalized public table score o4 Mini
openai-o4-mini
Imported 2026-05-27
4 Qwen/Qwen2-72B-Instruct 85.87 average normalized public table score — Imported 2026-05-27
5 chengang12345/Qwen2.5-32B-Instruct-FineTune 85.75 average normalized public table score — Imported 2026-05-27
6 Qwen/QwQ-32B-Preview 85.56 average normalized public table score — Imported 2026-05-27
7 Qwen/Qwen2.5-72B 85.46 average normalized public table score — Imported 2026-05-27
8 Qwen/Qwen3-30B-A3B 84.38 average normalized public table score Qwen3 30B A3B
qwen-qwen3-30b-a3b
Imported 2026-05-27
9 oxyapi/oxy-1-small 84.08 average normalized public table score — Imported 2026-05-27
10 akjindal53244/Llama-3.1-Storm-8B 83.69 average normalized public table score — Imported 2026-05-27
11 princeton-nlp/gemma-2-9b-it-SimPO 81.55 average normalized public table score — Imported 2026-05-27
12 Qwen/Qwen2.5-7B 81.31 average normalized public table score — Imported 2026-05-27
13 tiiuae/Falcon3-10B-Instruct 81.03 average normalized public table score — Imported 2026-05-27
14 tiiuae/Falcon3-7B-Instruct 79.83 average normalized public table score — Imported 2026-05-27
15 Qwen/Qwen2.5-3B 78.11 average normalized public table score — Imported 2026-05-27
16 meta-llama/Llama-3.1-405B-Instruct 76.81 average normalized public table score — Imported 2026-05-27
17 newsbang/Homer-v1.0-Qwen2.5-7B 76.48 average normalized public table score — Imported 2026-05-27
18 tiiuae/Falcon3-3B-Instruct 75.28 average normalized public table score — Imported 2026-05-27
19 moonshotai/Kimi-K2-Thinking 74.36 average normalized public table score KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-27
20 mistralai/Mistral-Large-3-675B-Instruct-2512 74.25 average normalized public table score — Imported 2026-05-27
21 silma-ai/SILMA-9B-Instruct-v1.0 73.25 average normalized public table score — Imported 2026-05-27
22 deepseek-ai/DeepSeek-V3.1 72.84 average normalized public table score — Imported 2026-05-27
23 baichuan-inc/Baichuan-M1-14B-Instruct 72.3 average normalized public table score — Imported 2026-05-27
24 HuggingFaceTB/SmolLM2-1.7B-Instruct 72.15 average normalized public table score — Imported 2026-05-27
25 tiiuae/falcon-11B 71.68 average normalized public table score — Imported 2026-05-27
26 tiiuae/Falcon3-1B-Instruct 69.39 average normalized public table score — Imported 2026-05-27
27 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 68.84 average normalized public table score R1 Distill Llama 70B
deepseek-deepseek-r1-distill-llama-70b
Imported 2026-05-27
28 deepseek-ai/DeepSeek-V3 68.46 average normalized public table score — Imported 2026-05-27
29 meta-llama/Llama-3.3-70B-Instruct 68.4 average normalized public table score Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-27
30 ProbeMedicalYonseiMAILab/medllama3-v20 67.81 average normalized public table score — Imported 2026-05-27
31 Qwen/Qwen2.5-72B-Instruct 66.82 average normalized public table score Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-27
32 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 66.4 average normalized public table score R1 Distill Qwen 32B
deepseek-deepseek-r1-distill-qwen-32b
Imported 2026-05-27
33 Qwen/Qwen3-235B-A22B 66.02 average normalized public table score Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-27
34 meta-llama/Llama-3.1-70B 65.56 average normalized public table score — Imported 2026-05-27
35 openai/gpt-4.1-mini 65.49 average normalized public table score GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-27
36 Intelligent-Internet/II-Medical-8B 65.08 average normalized public table score — Imported 2026-05-27
37 Qwen/Qwen2.5-7B-Instruct 64.83 average normalized public table score Qwen2.5 7B Instruct
qwen-qwen-2.5-7b-instruct
Imported 2026-05-27
38 google/medgemma-4b-it 64.2 average normalized public table score — Imported 2026-05-27
39 winninghealth/WiNGPT2-Gemma-2-9B-Chat 64.08 average normalized public table score — Imported 2026-05-27
40 meta-llama/Llama-4-Scout-17B-16E-Instruct 63.89 average normalized public table score Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-27
41 mistralai/Mistral-Large-Instruct-2407 63.67 average normalized public table score — Imported 2026-05-27
42 CohereForAI/aya-expanse-32b 63.37 average normalized public table score — Imported 2026-05-27
43 meta-llama/Llama-3.1-8B-Instruct 63.15 average normalized public table score Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-27
44 meta-llama/Llama-3.1-70B-Instruct 63.03 average normalized public table score Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-27
45 BiMediX/BiMediX-Bi 63 average normalized public table score — Imported 2026-05-27
46 m42-health/Llama3-Med42-8B 62.84 average normalized public table score — Imported 2026-05-27
47 FractalAIResearch/Ramanujan-Ganit-R1-14B 62.48 average normalized public table score — Imported 2026-05-27
48 HuggingFaceTB/SmolLM2-360M-Instruct 62.47 average normalized public table score — Imported 2026-05-27
49 meta-llama/Llama-4-Maverick-17B-128E-Instruct 62.27 average normalized public table score Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-27
50 Qwen/Qwen2-0.5B 62.2 average normalized public table score — Imported 2026-05-27
51 nvidia/Llama-3.1-Nemotron-70B-Instruct-HF 61.53 average normalized public table score — Imported 2026-05-27
52 mistralai/Mistral-7B-Instruct-v0.3 61.51 average normalized public table score — Imported 2026-05-27
53 openai/gpt-oss-120b 61.39 average normalized public table score gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
54 m42-health/Llama3-Med42-70B 61.34 average normalized public table score — Imported 2026-05-27
55 Qwen/Qwen2.5-3B-Instruct 61.19 average normalized public table score — Imported 2026-05-27
56 NousResearch/Hermes-3-Llama-3.1-8B 61.15 average normalized public table score — Imported 2026-05-27
57 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 61.14 average normalized public table score — Imported 2026-05-27
58 OpenMeditron/Meditron3-70B 61.1 average normalized public table score — Imported 2026-05-27
59 microsoft/MediPhi-Instruct 60.92 average normalized public table score — Imported 2026-05-27
60 microsoft/Phi-3.5-mini-instruct 60.86 average normalized public table score — Imported 2026-05-27
61 meta-llama/Meta-Llama-3-70B-Instruct 60.75 average normalized public table score Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-27
62 Qwen/Qwen3-4B 60.18 average normalized public table score — Imported 2026-05-27
63 deepseek-ai/DeepSeek-R1-Distill-Llama-8B 60.14 average normalized public table score — Imported 2026-05-27
64 microsoft/phi-4 60.11 average normalized public table score Phi 4
microsoft-phi-4
Imported 2026-05-27
65 upstage/SOLAR-10.7B-Instruct-v1.0 59.38 average normalized public table score — Imported 2026-05-27
66 google/gemma-3-27b-it 59.16 average normalized public table score Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-27
67 NousResearch/Hermes-2-Pro-Llama-3-8B 58.83 average normalized public table score L Hermes 2 Pro - Llama-3 8B
nousresearch-hermes-2-pro-llama-3-8b
Imported 2026-05-27
68 BioMistral/BioMistral-7B 58.59 average normalized public table score — Imported 2026-05-27
69 neulab/Pangea-7B 56.69 average normalized public table score — Imported 2026-05-27
70 aaditya/Llama3-OpenBioLLM-70B 55.77 average normalized public table score — Imported 2026-05-27
71 Qwen/Qwen3-32B 55.71 average normalized public table score Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-27
72 Qwen/Qwen3-1.7B 55.59 average normalized public table score — Imported 2026-05-27
73 openai/gpt-oss-20b 55.49 average normalized public table score gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
74 Qwen/Qwen3-8B 55.41 average normalized public table score Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27
75 tiiuae/falcon-mamba-7b-instruct 55.01 average normalized public table score — Imported 2026-05-27
76 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 54.49 average normalized public table score — Imported 2026-05-27
77 winninghealth/WiNGPT2-Llama-3-8B-Chat 53.72 average normalized public table score — Imported 2026-05-27
78 tiiuae/falcon-mamba-7b 53.43 average normalized public table score — Imported 2026-05-27
79 meta-llama/Llama-3.2-1B-Instruct 52.78 average normalized public table score Llama 3.2 1B Instruct
meta-llama-llama-3.2-1b-instruct
Imported 2026-05-27
80 tiiuae/Falcon3-Mamba-7B-Instruct 52.61 average normalized public table score — Imported 2026-05-27
81 Qwen/Qwen3-14B 52.09 average normalized public table score Qwen3 14B
qwen-qwen3-14b
Imported 2026-05-27
82 meta-llama/Llama-3.2-3B-Instruct 51.44 average normalized public table score Llama 3.2 3B Instruct
meta-llama-llama-3.2-3b-instruct
Imported 2026-05-27
83 01-ai/Yi-1.5-6B-Chat 50.45 average normalized public table score — Imported 2026-05-27
84 meta-llama/Llama-3.1-8B 50.06 average normalized public table score — Imported 2026-05-27
85 ministral/Ministral-3b-instruct 48.94 average normalized public table score — Imported 2026-05-27
86 Qwen/Qwen3-0.6B 48.85 average normalized public table score — Imported 2026-05-27
87 Qwen/Qwen3-30B-A3B-Thinking-2507 47.5 average normalized public table score Qwen3 30B A3B Thinking 2507
qwen-qwen3-30b-a3b-thinking-2507
Imported 2026-05-27
88 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 46.73 average normalized public table score — Imported 2026-05-27
89 Qwen/Qwen2.5-0.5B-Instruct 46.08 average normalized public table score — Imported 2026-05-27
90 google/medgemma-27b-text-it 45 average normalized public table score — Imported 2026-05-27
91 Qwen/Qwen3-4B-Thinking-2507 43 average normalized public table score — Imported 2026-05-27
92 deepseek-ai/DeepSeek-R1 35.5 average normalized public table score R1
deepseek-r1
Imported 2026-05-27
93 microsoft/MediPhi 32.03 average normalized public table score — Imported 2026-05-27
94 HuggingFaceTB/SmolLM3-3B 21.73 average normalized public table score — Imported 2026-05-27
95 openai/gpt-4o-mini-2024-07-18 19 average normalized public table score GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
96 Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1 8.5 average normalized public table score — Imported 2026-05-27