HealthBench Hard

HealthBench Hard: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.

47rows
overall_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Overall score, Responding under uncertainty, Health data tasks, Global health, Expertise-tailored communication, Context seeking, Emergency referrals, Response depth

Latest Results

Rows are derived from the public MEDIC Benchmark component table for HealthBench Hard. Overall score is used as the primary score, with HealthBench Hard subcategory metrics preserved.

Rank Subject Overall score Model Match Provenance Sampled
1 openai/gpt-oss-120b 0.6 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
2 google/gemma-3-27b-it 0.59 Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-27
3 Qwen/Qwen3-30B-A3B-Thinking-2507 0.58 Qwen3 30B A3B Thinking 2507
qwen-qwen3-30b-a3b-thinking-2507
Imported 2026-05-27
4 google/medgemma-27b-text-it 0.57 Imported 2026-05-27
5 Qwen/Qwen3-8B 0.56 Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27
6 Intelligent-Internet/II-Medical-8B 0.54 Imported 2026-05-27
7 Qwen/Qwen3-4B-Thinking-2507 0.54 Imported 2026-05-27
8 Qwen/Qwen3-235B-A22B 0.5 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-27
9 Qwen/Qwen3-32B 0.5 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-27
10 deepseek-ai/DeepSeek-R1 0.49 R1
deepseek-r1
Imported 2026-05-27
11 Qwen/Qwen2.5-72B-Instruct 0.49 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-27
12 openai/gpt-oss-20b 0.48 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
13 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 0.47 R1 Distill Llama 70B
deepseek-deepseek-r1-distill-llama-70b
Imported 2026-05-27
14 google/medgemma-4b-it 0.45 Imported 2026-05-27
15 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 0.44 R1 Distill Qwen 32B
deepseek-deepseek-r1-distill-qwen-32b
Imported 2026-05-27
16 Qwen/Qwen3-4B 0.43 Imported 2026-05-27
17 deepseek-ai/DeepSeek-V3 0.42 Imported 2026-05-27
18 Qwen/Qwen2.5-7B-Instruct 0.42 Qwen2.5 7B Instruct
qwen-qwen-2.5-7b-instruct
Imported 2026-05-27
19 meta-llama/Meta-Llama-3-70B-Instruct 0.41 Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-27
20 nvidia/Llama-3.1-Nemotron-70B-Instruct-HF 0.41 Imported 2026-05-27
21 meta-llama/Llama-3.3-70B-Instruct 0.4 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-27
22 openai/gpt-4.1-mini 0.4 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-27
23 HuggingFaceTB/SmolLM3-3B 0.39 Imported 2026-05-27
24 Qwen/Qwen2.5-3B-Instruct 0.38 Imported 2026-05-27
25 deepseek-ai/DeepSeek-R1-Distill-Llama-8B 0.35 Imported 2026-05-27
26 meta-llama/Llama-3.1-8B-Instruct 0.35 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-27
27 mistralai/Mistral-Large-Instruct-2407 0.35 Imported 2026-05-27
28 CohereForAI/aya-expanse-32b 0.34 Imported 2026-05-27
29 microsoft/phi-4 0.34 Phi 4
microsoft-phi-4
Imported 2026-05-27
30 m42-health/Llama3-Med42-70B 0.33 Imported 2026-05-27
31 openai/gpt-4o-mini-2024-07-18 0.33 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
32 meta-llama/Llama-4-Maverick-17B-128E-Instruct 0.32 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-27
33 meta-llama/Llama-4-Scout-17B-16E-Instruct 0.32 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-27
34 NousResearch/Hermes-3-Llama-3.1-8B 0.32 Imported 2026-05-27
35 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 0.31 Imported 2026-05-27
36 meta-llama/Llama-3.1-70B-Instruct 0.29 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-27
37 Qwen/Qwen3-14B 0.28 Qwen3 14B
qwen-qwen3-14b
Imported 2026-05-27
38 aaditya/Llama3-OpenBioLLM-70B 0.26 Imported 2026-05-27
39 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 0.26 Imported 2026-05-27
40 meta-llama/Llama-3.2-3B-Instruct 0.26 Llama 3.2 3B Instruct
meta-llama-llama-3.2-3b-instruct
Imported 2026-05-27
41 meta-llama/Llama-3.2-1B-Instruct 0.25 Llama 3.2 1B Instruct
meta-llama-llama-3.2-1b-instruct
Imported 2026-05-27
42 OpenMeditron/Meditron3-70B 0.21 Imported 2026-05-27
43 Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1 0.17 Imported 2026-05-27
44 Qwen/Qwen3-0.6B 0.16 Imported 2026-05-27
45 Qwen/Qwen3-1.7B 0.16 Imported 2026-05-27
46 Qwen/Qwen2.5-0.5B-Instruct 0.14 Imported 2026-05-27
47 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 0.04 Imported 2026-05-27