OpenAI-MRCR: 2 needle 1M

Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.

5rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 MiniMax M1 40K 0.59 Self-reported 2026-05-06
2 MiniMax M1 80K 0.56 Self-reported 2026-05-06
3 GPT-4.1 0.46 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
4 GPT-4.1 mini 0.33 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
5 GPT-4.1 nano 0.12 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06