Med-HALT | BenchmarkList

Metadata

ID: med_halt
Category: Healthcare
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Reasoning hallucination average accuracy, Reasoning hallucination average pointwise score, Memory hallucination average accuracy, Memory hallucination average pointwise score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Falcon 40B	44.725%	—	Imported	2026-05-27
2	Falcon 40B-instruct	41.205%	—	Imported	2026-05-27
3	Llama-2 70B	40.185%	—	Imported	2026-05-27
4	Text-Davinci	37.105%	—	Imported	2026-05-27
5	Llama-2 13B	32.53%	—	Imported	2026-05-27
6	GPT-3.5	32.22%	—	Imported	2026-05-27
7	Llama-2-7B	21.945%	—	Imported	2026-05-27
8	Llama-2-13B-chat	20.51%	—	Imported	2026-05-27
9	Llama-2-7B-chat	14.38%	—	Imported	2026-05-27
10	Llama-2 70B Chat	12.155%	—	Imported	2026-05-27