Omi SOAP Note Safety Benchmark
Safety-first clinical SOAP note generation benchmark measuring groundedness, hallucinations, coverage, and note quality across 300 doctor-patient dialogues.
6rows
compositeprimary metric
2026-04-21sampled
Metadata
Metrics
Composite, Safety, Evidence, Coverage, Generalist, Major Hallucinations per Note (lower is better), Major Risk vs Omi (lower is better), Minor Hallucinations per Note (lower is better), Minor Risk vs Omi (lower is better), Majority Major Rate (lower is better)
| Rank | Subject | Composite | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-5.2 | 4.72 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-04-21 |
| 2 | gemini-3-pro-preview | 4.70 | Gemini 3 google-gemini-3 | Imported | 2026-04-21 |
| 3 | Omi-SOAP-edge-v1 | 4.65 | — | Imported | 2026-04-21 |
| 4 | Kimi-K2-Thinking | 4.55 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-04-21 |
| 5 | claude-opus-4-5 | 4.54 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-04-21 |
| 6 | GPT-5 | 4.29 | GPT-5 openai-gpt-5 | Imported | 2026-04-21 |
No matching rows.