OpenAI-MRCR: 2 needle 1M
Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.
5rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | MiniMax M1 40K | 0.59 | — | Self-reported | 2026-05-06 |
| 2 | MiniMax M1 80K | 0.56 | — | Self-reported | 2026-05-06 |
| 3 | GPT-4.1 | 0.46 | GPT-4.1 openai-gpt-4.1 | Self-reported | 2026-05-06 |
| 4 | GPT-4.1 mini | 0.33 | GPT-4.1 Mini openai-gpt-4.1-mini | Self-reported | 2026-05-06 |
| 5 | GPT-4.1 nano | 0.12 | GPT-4.1 Nano openai-gpt-4.1-nano | Self-reported | 2026-05-06 |
No matching rows.