OpenAI-MRCR: 2 needle 128k

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

9rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5 0.95 GPT-5
openai-gpt-5
Self-reported 2026-05-06
2 MiniMax M1 40K 0.76 Self-reported 2026-05-06
3 MiniMax M1 80K 0.73 Self-reported 2026-05-06
4 GPT-4.1 0.57 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
5 GPT-4.1 mini 0.47 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
6 GPT-4.5 0.39 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
7 GPT-4.1 nano 0.37 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06
8 GPT-4o 0.32 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
9 o3-mini 0.19 o3-mini
openai-o3-mini
Self-reported 2026-05-06