Med-HALT

Med-HALT: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.

10rows
mean_hallucination_test_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Reasoning hallucination average accuracy, Reasoning hallucination average pointwise score, Memory hallucination average accuracy, Memory hallucination average pointwise score

Latest Results

Rows are transcribed from public Med-HALT CoNLL 2023 Tables 2 and 3. Primary score is a BenchmarkList-derived mean of reasoning and memory hallucination average accuracies.

Rank Subject Score Model Match Provenance Sampled
1 Falcon 40B 44.725% Imported 2026-05-27
2 Falcon 40B-instruct 41.205% Imported 2026-05-27
3 Llama-2 70B 40.185% Imported 2026-05-27
4 Text-Davinci 37.105% Imported 2026-05-27
5 Llama-2 13B 32.53% Imported 2026-05-27
6 GPT-3.5 32.22% Imported 2026-05-27
7 Llama-2-7B 21.945% Imported 2026-05-27
8 Llama-2-13B-chat 20.51% Imported 2026-05-27
9 Llama-2-7B-chat 14.38% Imported 2026-05-27
10 Llama-2 70B Chat 12.155% Imported 2026-05-27