MRCR v2 (8-needle)
MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.
9rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 0.93 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-06 |
| 2 | GPT-5.5 | 0.74 | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-06 |
| 3 | Gemini 3.1 Flash-Lite | 0.60 | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Self-reported | 2026-05-06 |
| 4 | GPT-5.4 mini | 0.34 | GPT-5.4 Mini openai-gpt-5.4-mini | Self-reported | 2026-05-06 |
| 5 | GPT-5.4 nano | 0.33 | GPT-5.4 Nano openai-gpt-5.4-nano | Self-reported | 2026-05-06 |
| 6 | Gemini 3 Pro | 0.26 | Gemini 3 google-gemini-3 | Self-reported | 2026-05-06 |
| 6 | Gemini 3.1 Pro | 0.26 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-06 |
| 8 | Gemini 3 Flash | 0.22 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Self-reported | 2026-05-06 |
| 9 | Gemini 2.5 Pro Preview 06-05 | 0.16 | Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview | Self-reported | 2026-05-06 |
No matching rows.