HealthBench Hard
HealthBench Hard: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
47rows
overall_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Overall score, Responding under uncertainty, Health data tasks, Global health, Expertise-tailored communication, Context seeking, Emergency referrals, Response depth
| Rank | Subject | Overall score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | openai/gpt-oss-120b | 0.6 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-27 |
| 2 | google/gemma-3-27b-it | 0.59 | Gemma 3 27B google-gemma-3-27b-it | Imported | 2026-05-27 |
| 3 | Qwen/Qwen3-30B-A3B-Thinking-2507 | 0.58 | Qwen3 30B A3B Thinking 2507 qwen-qwen3-30b-a3b-thinking-2507 | Imported | 2026-05-27 |
| 4 | google/medgemma-27b-text-it | 0.57 | — | Imported | 2026-05-27 |
| 5 | Qwen/Qwen3-8B | 0.56 | Qwen3 8B qwen-qwen3-8b | Imported | 2026-05-27 |
| 6 | Intelligent-Internet/II-Medical-8B | 0.54 | — | Imported | 2026-05-27 |
| 7 | Qwen/Qwen3-4B-Thinking-2507 | 0.54 | — | Imported | 2026-05-27 |
| 8 | Qwen/Qwen3-235B-A22B | 0.5 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-27 |
| 9 | Qwen/Qwen3-32B | 0.5 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 10 | deepseek-ai/DeepSeek-R1 | 0.49 | R1 deepseek-r1 | Imported | 2026-05-27 |
| 11 | Qwen/Qwen2.5-72B-Instruct | 0.49 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-27 |
| 12 | openai/gpt-oss-20b | 0.48 | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-05-27 |
| 13 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 0.47 | R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b | Imported | 2026-05-27 |
| 14 | google/medgemma-4b-it | 0.45 | — | Imported | 2026-05-27 |
| 15 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 0.44 | R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b | Imported | 2026-05-27 |
| 16 | Qwen/Qwen3-4B | 0.43 | — | Imported | 2026-05-27 |
| 17 | deepseek-ai/DeepSeek-V3 | 0.42 | — | Imported | 2026-05-27 |
| 18 | Qwen/Qwen2.5-7B-Instruct | 0.42 | Qwen2.5 7B Instruct qwen-qwen-2.5-7b-instruct | Imported | 2026-05-27 |
| 19 | meta-llama/Meta-Llama-3-70B-Instruct | 0.41 | Llama 3 70B Instruct meta-llama-llama-3-70b-instruct | Imported | 2026-05-27 |
| 20 | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | 0.41 | — | Imported | 2026-05-27 |
| 21 | meta-llama/Llama-3.3-70B-Instruct | 0.4 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-27 |
| 22 | openai/gpt-4.1-mini | 0.4 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-27 |
| 23 | HuggingFaceTB/SmolLM3-3B | 0.39 | — | Imported | 2026-05-27 |
| 24 | Qwen/Qwen2.5-3B-Instruct | 0.38 | — | Imported | 2026-05-27 |
| 25 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 0.35 | — | Imported | 2026-05-27 |
| 26 | meta-llama/Llama-3.1-8B-Instruct | 0.35 | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-27 |
| 27 | mistralai/Mistral-Large-Instruct-2407 | 0.35 | — | Imported | 2026-05-27 |
| 28 | CohereForAI/aya-expanse-32b | 0.34 | — | Imported | 2026-05-27 |
| 29 | microsoft/phi-4 | 0.34 | Phi 4 microsoft-phi-4 | Imported | 2026-05-27 |
| 30 | m42-health/Llama3-Med42-70B | 0.33 | — | Imported | 2026-05-27 |
| 31 | openai/gpt-4o-mini-2024-07-18 | 0.33 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 32 | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 0.32 | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-27 |
| 33 | meta-llama/Llama-4-Scout-17B-16E-Instruct | 0.32 | Llama 4 Scout meta-llama-llama-4-scout | Imported | 2026-05-27 |
| 34 | NousResearch/Hermes-3-Llama-3.1-8B | 0.32 | — | Imported | 2026-05-27 |
| 35 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 0.31 | — | Imported | 2026-05-27 |
| 36 | meta-llama/Llama-3.1-70B-Instruct | 0.29 | Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct | Imported | 2026-05-27 |
| 37 | Qwen/Qwen3-14B | 0.28 | Qwen3 14B qwen-qwen3-14b | Imported | 2026-05-27 |
| 38 | aaditya/Llama3-OpenBioLLM-70B | 0.26 | — | Imported | 2026-05-27 |
| 39 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 0.26 | — | Imported | 2026-05-27 |
| 40 | meta-llama/Llama-3.2-3B-Instruct | 0.26 | Llama 3.2 3B Instruct meta-llama-llama-3.2-3b-instruct | Imported | 2026-05-27 |
| 41 | meta-llama/Llama-3.2-1B-Instruct | 0.25 | Llama 3.2 1B Instruct meta-llama-llama-3.2-1b-instruct | Imported | 2026-05-27 |
| 42 | OpenMeditron/Meditron3-70B | 0.21 | — | Imported | 2026-05-27 |
| 43 | Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1 | 0.17 | — | Imported | 2026-05-27 |
| 44 | Qwen/Qwen3-0.6B | 0.16 | — | Imported | 2026-05-27 |
| 45 | Qwen/Qwen3-1.7B | 0.16 | — | Imported | 2026-05-27 |
| 46 | Qwen/Qwen2.5-0.5B-Instruct | 0.14 | — | Imported | 2026-05-27 |
| 47 | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 0.04 | — | Imported | 2026-05-27 |
No matching rows.