Medical Chronology LLM Benchmark

Medical chronology extraction benchmark evaluating LLMs on structured timeline extraction from synthetic medical-legal records across six golden datasets and three generation rounds.

11rows
composite_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Composite Score, Average F1, Average Precision, Average Recall, Average ROUGE-L, Average Token Overlap, Formatting Score, Chronological Score, Hallucination-Free Rate, Average Latency (lower is better), Mean Total Tokens

Latest Results

Rows are parsed from the public final leaderboard JSON. Source model display names are preserved; composite is the source weighted score.

Rank Subject Composite Score Model Match Provenance Sampled
1 claude-opus-4.6 0.92 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 gemini-2.5-flash 0.91 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
3 claude-opus-4.5 0.91 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
4 gemini-3-flash 0.91 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
5 gpt-5.4-mini 0.91 GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-06
6 gemini-2.5-pro 0.90 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
7 minimax-m2.5 0.89 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-06
8 gpt-5.4 0.89 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
9 gpt-5.4-pro 0.89 GPT-5.4 Pro
openai-gpt-5.4-pro
Imported 2026-05-06
10 gemini-3.1-pro 0.88 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
11 qwen3-235b 0.88 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06