PlaceboBench

Medical-domain hallucination benchmark with labeled model answers to pharmaceutical questions grounded in EMA product information.

7rows
non_hallucination_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Non-Hallucination Rate, Hallucination Rate (lower is better), Hallucinations per Answer (lower is better), Sample Count

Latest Results

Rows are aggregated by source model from the labeled PlaceboBench train split. Primary score is 100 minus the hallucinated-answer rate.

Rank Subject Non-Hallucination Rate Model Match Provenance Sampled
1 gemini-3-pro-preview 73.913 Gemini 3
google-gemini-3
Imported 2026-05-27
2 gpt-5.2 63.2353 GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
3 claude-sonnet-4-5 62.3188 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
4 accounts/fireworks/models/kimi-k2p5 53.6232 Imported 2026-05-27
5 gemini-3-flash-preview 44.9275 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-27
6 gpt-5-mini 39.1304 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-27
7 claude-opus-4-6 36.2319 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-27