Med-HALT
Med-HALT: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
10rows
mean_hallucination_test_accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Reasoning hallucination average accuracy, Reasoning hallucination average pointwise score, Memory hallucination average accuracy, Memory hallucination average pointwise score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Falcon 40B | 44.725% | — | Imported | 2026-05-27 |
| 2 | Falcon 40B-instruct | 41.205% | — | Imported | 2026-05-27 |
| 3 | Llama-2 70B | 40.185% | — | Imported | 2026-05-27 |
| 4 | Text-Davinci | 37.105% | — | Imported | 2026-05-27 |
| 5 | Llama-2 13B | 32.53% | — | Imported | 2026-05-27 |
| 6 | GPT-3.5 | 32.22% | — | Imported | 2026-05-27 |
| 7 | Llama-2-7B | 21.945% | — | Imported | 2026-05-27 |
| 8 | Llama-2-13B-chat | 20.51% | — | Imported | 2026-05-27 |
| 9 | Llama-2-7B-chat | 14.38% | — | Imported | 2026-05-27 |
| 10 | Llama-2 70B Chat | 12.155% | — | Imported | 2026-05-27 |
No matching rows.