MEDIC Benchmark | BenchmarkList

Metadata

ID: medic_benchmark
Category: Medical
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

MEDIC Public Table Average, clinical_response_overall: ELO, clinical_response_overall: ELO 95% CI (lower is better), clinical_response_overall: Score, clinical_response_overall: Score 95% CI (lower is better), clinical_response_medicationqa: ELO, clinical_response_medicationqa: ELO 95% CI (lower is better), clinical_response_medicationqa: Score, clinical_response_medicationqa: Score 95% CI (lower is better), clinical_response_healthsearchqa: ELO, clinical_response_healthsearchqa: ELO 95% CI (lower is better), clinical_response_healthsearchqa: Score, clinical_response_healthsearchqa: Score 95% CI (lower is better), clinical_response_liveqa: ELO, clinical_response_liveqa: ELO 95% CI (lower is better), clinical_response_liveqa: Score, clinical_response_liveqa: Score 95% CI (lower is better), clinical_response_medquad: ELO, clinical_response_medquad: ELO 95% CI (lower is better), clinical_response_medquad: Score, clinical_response_medquad: Score 95% CI (lower is better), clinical_response_mtsamples: ELO, clinical_response_mtsamples: ELO 95% CI (lower is better), clinical_response_mtsamples: Score, clinical_response_mtsamples: Score 95% CI (lower is better), clinical_response_misc: ELO, clinical_response_misc: ELO 95% CI (lower is better), clinical_response_misc: Score, clinical_response_misc: Score 95% CI (lower is better), note_generation: Overall Score, note_generation: Coverage, note_generation: Conformity, note_generation: Consistency, note_generation: Conciseness, medical_summarization: Overall Score, medical_summarization: Coverage, medical_summarization: Conformity, medical_summarization: Consistency, dischargeme: Overall Score, dischargeme: Coverage, dischargeme: Conformity, dischargeme: Consistency, healthbench_hard: Overall Score, healthbench_hard: Responding under uncertainty, healthbench_hard: Health data tasks, healthbench_hard: Global health, healthbench_hard: Expertise-tailored communication, healthbench_hard: Context seeking, healthbench_hard: Emergency referrals, healthbench_hard: Response depth, healthbench_consensus_hard: Overall Score, healthbench_consensus_hard: Responding under uncertainty, healthbench_consensus_hard: Health data tasks, healthbench_consensus_hard: Global health, healthbench_consensus_hard: Expertise-tailored communication, healthbench_consensus_hard: Context seeking, healthbench_consensus_hard: Emergency referrals, healthbench_consensus_hard: Response depth, med_safety: Harmfulness Score (lower is better), med_safety: 95% CI (lower is better), closed_ended_medical_qa: Average, closed_ended_medical_qa: MMLU, closed_ended_medical_qa: MMLU-Pro, closed_ended_medical_qa: MedMCQA, closed_ended_medical_qa: MedQA, closed_ended_medical_qa: USMLE, closed_ended_medical_qa: PubMedQA, closed_ended_medical_qa: ToxiGen, multilingual_medical_qa: Average, multilingual_medical_qa: 🇦🇪 Arabic, multilingual_medical_qa: 🇫🇷 French, multilingual_medical_qa: 🇪🇸 Spanish, multilingual_medical_qa: 🇵🇹 Portuguese, multilingual_medical_qa: 🇷🇴 Romanian, multilingual_medical_qa: 🇬🇷 Greek, ehrsql: RS (0), ehrsql: Abstains correct %, ehrsql: Abstains incorrect %, ehrsql: Abstains failed %, ehrsql_heldout: RS (0), ehrsql_heldout: Abstains correct %, ehrsql_heldout: Abstains incorrect %, ehrsql_heldout: Abstains failed %, medcalc: Lab, medcalc: Risk, medcalc: Physical, medcalc: Severity, medcalc: Diagnosis, medcalc: Date, medcalc: Dosage, medcalc: Overall, medcalc_qwen32b: Lab, medcalc_qwen32b: Risk, medcalc_qwen32b: Physical, medcalc_qwen32b: Severity, medcalc_qwen32b: Diagnosis, medcalc_qwen32b: Date, medcalc_qwen32b: Dosage, medcalc_qwen32b: Overall, medcalc_qwen14b: Lab, medcalc_qwen14b: Risk, medcalc_qwen14b: Physical, medcalc_qwen14b: Severity, medcalc_qwen14b: Diagnosis, medcalc_qwen14b: Date, medcalc_qwen14b: Dosage, medcalc_qwen14b: Overall, medec_ms: Error Flag Accuracy (%), medec_ms: Error Sentence ID Accuracy (%), medec_ms: Invalid Responses (%) (lower is better), medec_full: Error Flag Accuracy (%), medec_full: Error Sentence ID Accuracy (%), medec_full: Invalid Responses (%) (lower is better)

Rank	Subject	MEDIC Public Table Average	Model Match	Provenance	Sampled
1	google/gemini-2.5-flash-preview-04-17-thinking	92.36 average normalized public table score	—	Imported	2026-05-27
2	openai/gpt-4.1	91.71 average normalized public table score	GPT-4.1 openai-gpt-4.1	Imported	2026-05-27
3	openai/o4-mini	90.5 average normalized public table score	o4 Mini openai-o4-mini	Imported	2026-05-27
4	Qwen/Qwen2-72B-Instruct	85.87 average normalized public table score	—	Imported	2026-05-27
5	chengang12345/Qwen2.5-32B-Instruct-FineTune	85.75 average normalized public table score	—	Imported	2026-05-27
6	Qwen/QwQ-32B-Preview	85.56 average normalized public table score	—	Imported	2026-05-27
7	Qwen/Qwen2.5-72B	85.46 average normalized public table score	—	Imported	2026-05-27
8	Qwen/Qwen3-30B-A3B	84.38 average normalized public table score	Qwen3 30B A3B qwen-qwen3-30b-a3b	Imported	2026-05-27
9	oxyapi/oxy-1-small	84.08 average normalized public table score	—	Imported	2026-05-27
10	akjindal53244/Llama-3.1-Storm-8B	83.69 average normalized public table score	—	Imported	2026-05-27
11	princeton-nlp/gemma-2-9b-it-SimPO	81.55 average normalized public table score	—	Imported	2026-05-27
12	Qwen/Qwen2.5-7B	81.31 average normalized public table score	—	Imported	2026-05-27
13	tiiuae/Falcon3-10B-Instruct	81.03 average normalized public table score	—	Imported	2026-05-27
14	tiiuae/Falcon3-7B-Instruct	79.83 average normalized public table score	—	Imported	2026-05-27
15	Qwen/Qwen2.5-3B	78.11 average normalized public table score	—	Imported	2026-05-27
16	meta-llama/Llama-3.1-405B-Instruct	76.81 average normalized public table score	—	Imported	2026-05-27
17	newsbang/Homer-v1.0-Qwen2.5-7B	76.48 average normalized public table score	—	Imported	2026-05-27
18	tiiuae/Falcon3-3B-Instruct	75.28 average normalized public table score	—	Imported	2026-05-27
19	moonshotai/Kimi-K2-Thinking	74.36 average normalized public table score	KIMI MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking	Imported	2026-05-27
20	mistralai/Mistral-Large-3-675B-Instruct-2512	74.25 average normalized public table score	—	Imported	2026-05-27
21	silma-ai/SILMA-9B-Instruct-v1.0	73.25 average normalized public table score	—	Imported	2026-05-27
22	deepseek-ai/DeepSeek-V3.1	72.84 average normalized public table score	—	Imported	2026-05-27
23	baichuan-inc/Baichuan-M1-14B-Instruct	72.3 average normalized public table score	—	Imported	2026-05-27
24	HuggingFaceTB/SmolLM2-1.7B-Instruct	72.15 average normalized public table score	—	Imported	2026-05-27
25	tiiuae/falcon-11B	71.68 average normalized public table score	—	Imported	2026-05-27
26	tiiuae/Falcon3-1B-Instruct	69.39 average normalized public table score	—	Imported	2026-05-27
27	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	68.84 average normalized public table score	R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b	Imported	2026-05-27
28	deepseek-ai/DeepSeek-V3	68.46 average normalized public table score	—	Imported	2026-05-27
29	meta-llama/Llama-3.3-70B-Instruct	68.4 average normalized public table score	Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct	Imported	2026-05-27
30	ProbeMedicalYonseiMAILab/medllama3-v20	67.81 average normalized public table score	—	Imported	2026-05-27
31	Qwen/Qwen2.5-72B-Instruct	66.82 average normalized public table score	Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct	Imported	2026-05-27
32	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	66.4 average normalized public table score	R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b	Imported	2026-05-27
33	Qwen/Qwen3-235B-A22B	66.02 average normalized public table score	Qwen3 235B A22B qwen-qwen3-235b-a22b	Imported	2026-05-27
34	meta-llama/Llama-3.1-70B	65.56 average normalized public table score	—	Imported	2026-05-27
35	openai/gpt-4.1-mini	65.49 average normalized public table score	GPT-4.1 Mini openai-gpt-4.1-mini	Imported	2026-05-27
36	Intelligent-Internet/II-Medical-8B	65.08 average normalized public table score	—	Imported	2026-05-27
37	Qwen/Qwen2.5-7B-Instruct	64.83 average normalized public table score	Qwen2.5 7B Instruct qwen-qwen-2.5-7b-instruct	Imported	2026-05-27
38	google/medgemma-4b-it	64.2 average normalized public table score	—	Imported	2026-05-27
39	winninghealth/WiNGPT2-Gemma-2-9B-Chat	64.08 average normalized public table score	—	Imported	2026-05-27
40	meta-llama/Llama-4-Scout-17B-16E-Instruct	63.89 average normalized public table score	Llama 4 Scout meta-llama-llama-4-scout	Imported	2026-05-27
41	mistralai/Mistral-Large-Instruct-2407	63.67 average normalized public table score	—	Imported	2026-05-27
42	CohereForAI/aya-expanse-32b	63.37 average normalized public table score	—	Imported	2026-05-27
43	meta-llama/Llama-3.1-8B-Instruct	63.15 average normalized public table score	Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct	Imported	2026-05-27
44	meta-llama/Llama-3.1-70B-Instruct	63.03 average normalized public table score	Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct	Imported	2026-05-27
45	BiMediX/BiMediX-Bi	63 average normalized public table score	—	Imported	2026-05-27
46	m42-health/Llama3-Med42-8B	62.84 average normalized public table score	—	Imported	2026-05-27
47	FractalAIResearch/Ramanujan-Ganit-R1-14B	62.48 average normalized public table score	—	Imported	2026-05-27
48	HuggingFaceTB/SmolLM2-360M-Instruct	62.47 average normalized public table score	—	Imported	2026-05-27
49	meta-llama/Llama-4-Maverick-17B-128E-Instruct	62.27 average normalized public table score	Llama 4 Maverick meta-llama-4-maverick	Imported	2026-05-27
50	Qwen/Qwen2-0.5B	62.2 average normalized public table score	—	Imported	2026-05-27
51	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	61.53 average normalized public table score	—	Imported	2026-05-27
52	mistralai/Mistral-7B-Instruct-v0.3	61.51 average normalized public table score	—	Imported	2026-05-27
53	openai/gpt-oss-120b	61.39 average normalized public table score	gpt-oss-120b openai-gpt-oss-120b	Imported	2026-05-27
54	m42-health/Llama3-Med42-70B	61.34 average normalized public table score	—	Imported	2026-05-27
55	Qwen/Qwen2.5-3B-Instruct	61.19 average normalized public table score	—	Imported	2026-05-27
56	NousResearch/Hermes-3-Llama-3.1-8B	61.15 average normalized public table score	—	Imported	2026-05-27
57	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	61.14 average normalized public table score	—	Imported	2026-05-27
58	OpenMeditron/Meditron3-70B	61.1 average normalized public table score	—	Imported	2026-05-27
59	microsoft/MediPhi-Instruct	60.92 average normalized public table score	—	Imported	2026-05-27
60	microsoft/Phi-3.5-mini-instruct	60.86 average normalized public table score	—	Imported	2026-05-27
61	meta-llama/Meta-Llama-3-70B-Instruct	60.75 average normalized public table score	Llama 3 70B Instruct meta-llama-llama-3-70b-instruct	Imported	2026-05-27
62	Qwen/Qwen3-4B	60.18 average normalized public table score	—	Imported	2026-05-27
63	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	60.14 average normalized public table score	—	Imported	2026-05-27
64	microsoft/phi-4	60.11 average normalized public table score	Phi 4 microsoft-phi-4	Imported	2026-05-27
65	upstage/SOLAR-10.7B-Instruct-v1.0	59.38 average normalized public table score	—	Imported	2026-05-27
66	google/gemma-3-27b-it	59.16 average normalized public table score	Gemma 3 27B google-gemma-3-27b-it	Imported	2026-05-27
67	NousResearch/Hermes-2-Pro-Llama-3-8B	58.83 average normalized public table score	L Hermes 2 Pro - Llama-3 8B nousresearch-hermes-2-pro-llama-3-8b	Imported	2026-05-27
68	BioMistral/BioMistral-7B	58.59 average normalized public table score	—	Imported	2026-05-27
69	neulab/Pangea-7B	56.69 average normalized public table score	—	Imported	2026-05-27
70	aaditya/Llama3-OpenBioLLM-70B	55.77 average normalized public table score	—	Imported	2026-05-27
71	Qwen/Qwen3-32B	55.71 average normalized public table score	Qwen3 32B qwen-qwen3-32b	Imported	2026-05-27
72	Qwen/Qwen3-1.7B	55.59 average normalized public table score	—	Imported	2026-05-27
73	openai/gpt-oss-20b	55.49 average normalized public table score	gpt-oss-20b openai-gpt-oss-20b	Imported	2026-05-27
74	Qwen/Qwen3-8B	55.41 average normalized public table score	Qwen3 8B qwen-qwen3-8b	Imported	2026-05-27
75	tiiuae/falcon-mamba-7b-instruct	55.01 average normalized public table score	—	Imported	2026-05-27
76	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	54.49 average normalized public table score	—	Imported	2026-05-27
77	winninghealth/WiNGPT2-Llama-3-8B-Chat	53.72 average normalized public table score	—	Imported	2026-05-27
78	tiiuae/falcon-mamba-7b	53.43 average normalized public table score	—	Imported	2026-05-27
79	meta-llama/Llama-3.2-1B-Instruct	52.78 average normalized public table score	Llama 3.2 1B Instruct meta-llama-llama-3.2-1b-instruct	Imported	2026-05-27
80	tiiuae/Falcon3-Mamba-7B-Instruct	52.61 average normalized public table score	—	Imported	2026-05-27
81	Qwen/Qwen3-14B	52.09 average normalized public table score	Qwen3 14B qwen-qwen3-14b	Imported	2026-05-27
82	meta-llama/Llama-3.2-3B-Instruct	51.44 average normalized public table score	Llama 3.2 3B Instruct meta-llama-llama-3.2-3b-instruct	Imported	2026-05-27
83	01-ai/Yi-1.5-6B-Chat	50.45 average normalized public table score	—	Imported	2026-05-27
84	meta-llama/Llama-3.1-8B	50.06 average normalized public table score	—	Imported	2026-05-27
85	ministral/Ministral-3b-instruct	48.94 average normalized public table score	—	Imported	2026-05-27
86	Qwen/Qwen3-0.6B	48.85 average normalized public table score	—	Imported	2026-05-27
87	Qwen/Qwen3-30B-A3B-Thinking-2507	47.5 average normalized public table score	Qwen3 30B A3B Thinking 2507 qwen-qwen3-30b-a3b-thinking-2507	Imported	2026-05-27
88	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	46.73 average normalized public table score	—	Imported	2026-05-27
89	Qwen/Qwen2.5-0.5B-Instruct	46.08 average normalized public table score	—	Imported	2026-05-27
90	google/medgemma-27b-text-it	45 average normalized public table score	—	Imported	2026-05-27
91	Qwen/Qwen3-4B-Thinking-2507	43 average normalized public table score	—	Imported	2026-05-27
92	deepseek-ai/DeepSeek-R1	35.5 average normalized public table score	R1 deepseek-r1	Imported	2026-05-27
93	microsoft/MediPhi	32.03 average normalized public table score	—	Imported	2026-05-27
94	HuggingFaceTB/SmolLM3-3B	21.73 average normalized public table score	—	Imported	2026-05-27
95	openai/gpt-4o-mini-2024-07-18	19 average normalized public table score	GPT-4o-mini openai-gpt-4o-mini	Imported	2026-05-27
96	Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1	8.5 average normalized public table score	—	Imported	2026-05-27

Metadata

Metrics

Latest Results