Omi SOAP Note Safety Benchmark

Safety-first clinical SOAP note generation benchmark measuring groundedness, hallucinations, coverage, and note quality across 300 doctor-patient dialogues.

6rows
compositeprimary metric
2026-04-21sampled

Metadata

Metrics

Composite, Safety, Evidence, Coverage, Generalist, Major Hallucinations per Note (lower is better), Major Risk vs Omi (lower is better), Minor Hallucinations per Note (lower is better), Minor Risk vs Omi (lower is better), Majority Major Rate (lower is better)

Latest Results

Rows ranked by highest Composite score.

Rank Subject Composite Model Match Provenance Sampled
1 gpt-5.2 4.72 GPT-5.2
openai-gpt-5.2
Imported 2026-04-21
2 gemini-3-pro-preview 4.70 Gemini 3
google-gemini-3
Imported 2026-04-21
3 Omi-SOAP-edge-v1 4.65 Imported 2026-04-21
4 Kimi-K2-Thinking 4.55 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-04-21
5 claude-opus-4-5 4.54 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-04-21
6 GPT-5 4.29 GPT-5
openai-gpt-5
Imported 2026-04-21