MEDIC Benchmark
Clinical LLM benchmark leaderboard spanning closed-ended medical QA, open-ended clinical tasks, medical safety, summarization, note generation, HealthBench, EHRSQL, MedCalc, MedEC, general-domain, and DischargeMe evaluations.
Metadata
Metrics
MEDIC Public Table Average, clinical_response_overall: ELO, clinical_response_overall: ELO 95% CI (lower is better), clinical_response_overall: Score, clinical_response_overall: Score 95% CI (lower is better), clinical_response_medicationqa: ELO, clinical_response_medicationqa: ELO 95% CI (lower is better), clinical_response_medicationqa: Score, clinical_response_medicationqa: Score 95% CI (lower is better), clinical_response_healthsearchqa: ELO, clinical_response_healthsearchqa: ELO 95% CI (lower is better), clinical_response_healthsearchqa: Score, clinical_response_healthsearchqa: Score 95% CI (lower is better), clinical_response_liveqa: ELO, clinical_response_liveqa: ELO 95% CI (lower is better), clinical_response_liveqa: Score, clinical_response_liveqa: Score 95% CI (lower is better), clinical_response_medquad: ELO, clinical_response_medquad: ELO 95% CI (lower is better), clinical_response_medquad: Score, clinical_response_medquad: Score 95% CI (lower is better), clinical_response_mtsamples: ELO, clinical_response_mtsamples: ELO 95% CI (lower is better), clinical_response_mtsamples: Score, clinical_response_mtsamples: Score 95% CI (lower is better), clinical_response_misc: ELO, clinical_response_misc: ELO 95% CI (lower is better), clinical_response_misc: Score, clinical_response_misc: Score 95% CI (lower is better), note_generation: Overall Score, note_generation: Coverage, note_generation: Conformity, note_generation: Consistency, note_generation: Conciseness, medical_summarization: Overall Score, medical_summarization: Coverage, medical_summarization: Conformity, medical_summarization: Consistency, dischargeme: Overall Score, dischargeme: Coverage, dischargeme: Conformity, dischargeme: Consistency, healthbench_hard: Overall Score, healthbench_hard: Responding under uncertainty, healthbench_hard: Health data tasks, healthbench_hard: Global health, healthbench_hard: Expertise-tailored communication, healthbench_hard: Context seeking, healthbench_hard: Emergency referrals, healthbench_hard: Response depth, healthbench_consensus_hard: Overall Score, healthbench_consensus_hard: Responding under uncertainty, healthbench_consensus_hard: Health data tasks, healthbench_consensus_hard: Global health, healthbench_consensus_hard: Expertise-tailored communication, healthbench_consensus_hard: Context seeking, healthbench_consensus_hard: Emergency referrals, healthbench_consensus_hard: Response depth, med_safety: Harmfulness Score (lower is better), med_safety: 95% CI (lower is better), closed_ended_medical_qa: Average, closed_ended_medical_qa: MMLU, closed_ended_medical_qa: MMLU-Pro, closed_ended_medical_qa: MedMCQA, closed_ended_medical_qa: MedQA, closed_ended_medical_qa: USMLE, closed_ended_medical_qa: PubMedQA, closed_ended_medical_qa: ToxiGen, multilingual_medical_qa: Average, multilingual_medical_qa: 🇦🇪 Arabic, multilingual_medical_qa: 🇫🇷 French, multilingual_medical_qa: 🇪🇸 Spanish, multilingual_medical_qa: 🇵🇹 Portuguese, multilingual_medical_qa: 🇷🇴 Romanian, multilingual_medical_qa: 🇬🇷 Greek, ehrsql: RS (0), ehrsql: Abstains correct %, ehrsql: Abstains incorrect %, ehrsql: Abstains failed %, ehrsql_heldout: RS (0), ehrsql_heldout: Abstains correct %, ehrsql_heldout: Abstains incorrect %, ehrsql_heldout: Abstains failed %, medcalc: Lab, medcalc: Risk, medcalc: Physical, medcalc: Severity, medcalc: Diagnosis, medcalc: Date, medcalc: Dosage, medcalc: Overall, medcalc_qwen32b: Lab, medcalc_qwen32b: Risk, medcalc_qwen32b: Physical, medcalc_qwen32b: Severity, medcalc_qwen32b: Diagnosis, medcalc_qwen32b: Date, medcalc_qwen32b: Dosage, medcalc_qwen32b: Overall, medcalc_qwen14b: Lab, medcalc_qwen14b: Risk, medcalc_qwen14b: Physical, medcalc_qwen14b: Severity, medcalc_qwen14b: Diagnosis, medcalc_qwen14b: Date, medcalc_qwen14b: Dosage, medcalc_qwen14b: Overall, medec_ms: Error Flag Accuracy (%), medec_ms: Error Sentence ID Accuracy (%), medec_ms: Invalid Responses (%) (lower is better), medec_full: Error Flag Accuracy (%), medec_full: Error Sentence ID Accuracy (%), medec_full: Invalid Responses (%) (lower is better)
| Rank | Subject | MEDIC Public Table Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-preview-04-17-thinking | 92.36 average normalized public table score | — | Imported | 2026-05-27 |
| 2 | openai/gpt-4.1 | 91.71 average normalized public table score | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-27 |
| 3 | openai/o4-mini | 90.5 average normalized public table score | o4 Mini openai-o4-mini | Imported | 2026-05-27 |
| 4 | Qwen/Qwen2-72B-Instruct | 85.87 average normalized public table score | — | Imported | 2026-05-27 |
| 5 | chengang12345/Qwen2.5-32B-Instruct-FineTune | 85.75 average normalized public table score | — | Imported | 2026-05-27 |
| 6 | Qwen/QwQ-32B-Preview | 85.56 average normalized public table score | — | Imported | 2026-05-27 |
| 7 | Qwen/Qwen2.5-72B | 85.46 average normalized public table score | — | Imported | 2026-05-27 |
| 8 | Qwen/Qwen3-30B-A3B | 84.38 average normalized public table score | Qwen3 30B A3B qwen-qwen3-30b-a3b | Imported | 2026-05-27 |
| 9 | oxyapi/oxy-1-small | 84.08 average normalized public table score | — | Imported | 2026-05-27 |
| 10 | akjindal53244/Llama-3.1-Storm-8B | 83.69 average normalized public table score | — | Imported | 2026-05-27 |
| 11 | princeton-nlp/gemma-2-9b-it-SimPO | 81.55 average normalized public table score | — | Imported | 2026-05-27 |
| 12 | Qwen/Qwen2.5-7B | 81.31 average normalized public table score | — | Imported | 2026-05-27 |
| 13 | tiiuae/Falcon3-10B-Instruct | 81.03 average normalized public table score | — | Imported | 2026-05-27 |
| 14 | tiiuae/Falcon3-7B-Instruct | 79.83 average normalized public table score | — | Imported | 2026-05-27 |
| 15 | Qwen/Qwen2.5-3B | 78.11 average normalized public table score | — | Imported | 2026-05-27 |
| 16 | meta-llama/Llama-3.1-405B-Instruct | 76.81 average normalized public table score | — | Imported | 2026-05-27 |
| 17 | newsbang/Homer-v1.0-Qwen2.5-7B | 76.48 average normalized public table score | — | Imported | 2026-05-27 |
| 18 | tiiuae/Falcon3-3B-Instruct | 75.28 average normalized public table score | — | Imported | 2026-05-27 |
| 19 | moonshotai/Kimi-K2-Thinking | 74.36 average normalized public table score | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-27 |
| 20 | mistralai/Mistral-Large-3-675B-Instruct-2512 | 74.25 average normalized public table score | — | Imported | 2026-05-27 |
| 21 | silma-ai/SILMA-9B-Instruct-v1.0 | 73.25 average normalized public table score | — | Imported | 2026-05-27 |
| 22 | deepseek-ai/DeepSeek-V3.1 | 72.84 average normalized public table score | — | Imported | 2026-05-27 |
| 23 | baichuan-inc/Baichuan-M1-14B-Instruct | 72.3 average normalized public table score | — | Imported | 2026-05-27 |
| 24 | HuggingFaceTB/SmolLM2-1.7B-Instruct | 72.15 average normalized public table score | — | Imported | 2026-05-27 |
| 25 | tiiuae/falcon-11B | 71.68 average normalized public table score | — | Imported | 2026-05-27 |
| 26 | tiiuae/Falcon3-1B-Instruct | 69.39 average normalized public table score | — | Imported | 2026-05-27 |
| 27 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 68.84 average normalized public table score | R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b | Imported | 2026-05-27 |
| 28 | deepseek-ai/DeepSeek-V3 | 68.46 average normalized public table score | — | Imported | 2026-05-27 |
| 29 | meta-llama/Llama-3.3-70B-Instruct | 68.4 average normalized public table score | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-27 |
| 30 | ProbeMedicalYonseiMAILab/medllama3-v20 | 67.81 average normalized public table score | — | Imported | 2026-05-27 |
| 31 | Qwen/Qwen2.5-72B-Instruct | 66.82 average normalized public table score | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-27 |
| 32 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 66.4 average normalized public table score | R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b | Imported | 2026-05-27 |
| 33 | Qwen/Qwen3-235B-A22B | 66.02 average normalized public table score | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-27 |
| 34 | meta-llama/Llama-3.1-70B | 65.56 average normalized public table score | — | Imported | 2026-05-27 |
| 35 | openai/gpt-4.1-mini | 65.49 average normalized public table score | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-27 |
| 36 | Intelligent-Internet/II-Medical-8B | 65.08 average normalized public table score | — | Imported | 2026-05-27 |
| 37 | Qwen/Qwen2.5-7B-Instruct | 64.83 average normalized public table score | Qwen2.5 7B Instruct qwen-qwen-2.5-7b-instruct | Imported | 2026-05-27 |
| 38 | google/medgemma-4b-it | 64.2 average normalized public table score | — | Imported | 2026-05-27 |
| 39 | winninghealth/WiNGPT2-Gemma-2-9B-Chat | 64.08 average normalized public table score | — | Imported | 2026-05-27 |
| 40 | meta-llama/Llama-4-Scout-17B-16E-Instruct | 63.89 average normalized public table score | Llama 4 Scout meta-llama-llama-4-scout | Imported | 2026-05-27 |
| 41 | mistralai/Mistral-Large-Instruct-2407 | 63.67 average normalized public table score | — | Imported | 2026-05-27 |
| 42 | CohereForAI/aya-expanse-32b | 63.37 average normalized public table score | — | Imported | 2026-05-27 |
| 43 | meta-llama/Llama-3.1-8B-Instruct | 63.15 average normalized public table score | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-27 |
| 44 | meta-llama/Llama-3.1-70B-Instruct | 63.03 average normalized public table score | Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct | Imported | 2026-05-27 |
| 45 | BiMediX/BiMediX-Bi | 63 average normalized public table score | — | Imported | 2026-05-27 |
| 46 | m42-health/Llama3-Med42-8B | 62.84 average normalized public table score | — | Imported | 2026-05-27 |
| 47 | FractalAIResearch/Ramanujan-Ganit-R1-14B | 62.48 average normalized public table score | — | Imported | 2026-05-27 |
| 48 | HuggingFaceTB/SmolLM2-360M-Instruct | 62.47 average normalized public table score | — | Imported | 2026-05-27 |
| 49 | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 62.27 average normalized public table score | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-27 |
| 50 | Qwen/Qwen2-0.5B | 62.2 average normalized public table score | — | Imported | 2026-05-27 |
| 51 | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | 61.53 average normalized public table score | — | Imported | 2026-05-27 |
| 52 | mistralai/Mistral-7B-Instruct-v0.3 | 61.51 average normalized public table score | — | Imported | 2026-05-27 |
| 53 | openai/gpt-oss-120b | 61.39 average normalized public table score | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-27 |
| 54 | m42-health/Llama3-Med42-70B | 61.34 average normalized public table score | — | Imported | 2026-05-27 |
| 55 | Qwen/Qwen2.5-3B-Instruct | 61.19 average normalized public table score | — | Imported | 2026-05-27 |
| 56 | NousResearch/Hermes-3-Llama-3.1-8B | 61.15 average normalized public table score | — | Imported | 2026-05-27 |
| 57 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 61.14 average normalized public table score | — | Imported | 2026-05-27 |
| 58 | OpenMeditron/Meditron3-70B | 61.1 average normalized public table score | — | Imported | 2026-05-27 |
| 59 | microsoft/MediPhi-Instruct | 60.92 average normalized public table score | — | Imported | 2026-05-27 |
| 60 | microsoft/Phi-3.5-mini-instruct | 60.86 average normalized public table score | — | Imported | 2026-05-27 |
| 61 | meta-llama/Meta-Llama-3-70B-Instruct | 60.75 average normalized public table score | Llama 3 70B Instruct meta-llama-llama-3-70b-instruct | Imported | 2026-05-27 |
| 62 | Qwen/Qwen3-4B | 60.18 average normalized public table score | — | Imported | 2026-05-27 |
| 63 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 60.14 average normalized public table score | — | Imported | 2026-05-27 |
| 64 | microsoft/phi-4 | 60.11 average normalized public table score | Phi 4 microsoft-phi-4 | Imported | 2026-05-27 |
| 65 | upstage/SOLAR-10.7B-Instruct-v1.0 | 59.38 average normalized public table score | — | Imported | 2026-05-27 |
| 66 | google/gemma-3-27b-it | 59.16 average normalized public table score | Gemma 3 27B google-gemma-3-27b-it | Imported | 2026-05-27 |
| 67 | NousResearch/Hermes-2-Pro-Llama-3-8B | 58.83 average normalized public table score | Hermes 2 Pro - Llama-3 8B nousresearch-hermes-2-pro-llama-3-8b | Imported | 2026-05-27 |
| 68 | BioMistral/BioMistral-7B | 58.59 average normalized public table score | — | Imported | 2026-05-27 |
| 69 | neulab/Pangea-7B | 56.69 average normalized public table score | — | Imported | 2026-05-27 |
| 70 | aaditya/Llama3-OpenBioLLM-70B | 55.77 average normalized public table score | — | Imported | 2026-05-27 |
| 71 | Qwen/Qwen3-32B | 55.71 average normalized public table score | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 72 | Qwen/Qwen3-1.7B | 55.59 average normalized public table score | — | Imported | 2026-05-27 |
| 73 | openai/gpt-oss-20b | 55.49 average normalized public table score | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-05-27 |
| 74 | Qwen/Qwen3-8B | 55.41 average normalized public table score | Qwen3 8B qwen-qwen3-8b | Imported | 2026-05-27 |
| 75 | tiiuae/falcon-mamba-7b-instruct | 55.01 average normalized public table score | — | Imported | 2026-05-27 |
| 76 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 54.49 average normalized public table score | — | Imported | 2026-05-27 |
| 77 | winninghealth/WiNGPT2-Llama-3-8B-Chat | 53.72 average normalized public table score | — | Imported | 2026-05-27 |
| 78 | tiiuae/falcon-mamba-7b | 53.43 average normalized public table score | — | Imported | 2026-05-27 |
| 79 | meta-llama/Llama-3.2-1B-Instruct | 52.78 average normalized public table score | Llama 3.2 1B Instruct meta-llama-llama-3.2-1b-instruct | Imported | 2026-05-27 |
| 80 | tiiuae/Falcon3-Mamba-7B-Instruct | 52.61 average normalized public table score | — | Imported | 2026-05-27 |
| 81 | Qwen/Qwen3-14B | 52.09 average normalized public table score | Qwen3 14B qwen-qwen3-14b | Imported | 2026-05-27 |
| 82 | meta-llama/Llama-3.2-3B-Instruct | 51.44 average normalized public table score | Llama 3.2 3B Instruct meta-llama-llama-3.2-3b-instruct | Imported | 2026-05-27 |
| 83 | 01-ai/Yi-1.5-6B-Chat | 50.45 average normalized public table score | — | Imported | 2026-05-27 |
| 84 | meta-llama/Llama-3.1-8B | 50.06 average normalized public table score | — | Imported | 2026-05-27 |
| 85 | ministral/Ministral-3b-instruct | 48.94 average normalized public table score | — | Imported | 2026-05-27 |
| 86 | Qwen/Qwen3-0.6B | 48.85 average normalized public table score | — | Imported | 2026-05-27 |
| 87 | Qwen/Qwen3-30B-A3B-Thinking-2507 | 47.5 average normalized public table score | Qwen3 30B A3B Thinking 2507 qwen-qwen3-30b-a3b-thinking-2507 | Imported | 2026-05-27 |
| 88 | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 46.73 average normalized public table score | — | Imported | 2026-05-27 |
| 89 | Qwen/Qwen2.5-0.5B-Instruct | 46.08 average normalized public table score | — | Imported | 2026-05-27 |
| 90 | google/medgemma-27b-text-it | 45 average normalized public table score | — | Imported | 2026-05-27 |
| 91 | Qwen/Qwen3-4B-Thinking-2507 | 43 average normalized public table score | — | Imported | 2026-05-27 |
| 92 | deepseek-ai/DeepSeek-R1 | 35.5 average normalized public table score | R1 deepseek-r1 | Imported | 2026-05-27 |
| 93 | microsoft/MediPhi | 32.03 average normalized public table score | — | Imported | 2026-05-27 |
| 94 | HuggingFaceTB/SmolLM3-3B | 21.73 average normalized public table score | — | Imported | 2026-05-27 |
| 95 | openai/gpt-4o-mini-2024-07-18 | 19 average normalized public table score | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 96 | Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1 | 8.5 average normalized public table score | — | Imported | 2026-05-27 |
No matching rows.