Medical Chronology LLM Benchmark
Medical chronology extraction benchmark evaluating LLMs on structured timeline extraction from synthetic medical-legal records across six golden datasets and three generation rounds.
11rows
composite_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Composite Score, Average F1, Average Precision, Average Recall, Average ROUGE-L, Average Token Overlap, Formatting Score, Chronological Score, Hallucination-Free Rate, Average Latency (lower is better), Mean Total Tokens
| Rank | Subject | Composite Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | claude-opus-4.6 | 0.92 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | gemini-2.5-flash | 0.91 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 3 | claude-opus-4.5 | 0.91 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 4 | gemini-3-flash | 0.91 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 5 | gpt-5.4-mini | 0.91 | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-06 |
| 6 | gemini-2.5-pro | 0.90 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 7 | minimax-m2.5 | 0.89 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-06 |
| 8 | gpt-5.4 | 0.89 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 9 | gpt-5.4-pro | 0.89 | GPT-5.4 Pro openai-gpt-5.4-pro | Imported | 2026-05-06 |
| 10 | gemini-3.1-pro | 0.88 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 11 | qwen3-235b | 0.88 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-06 |
No matching rows.