MRCR v2 (8-needle)

MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.

9rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.6 0.93 Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-06
2 GPT-5.5 0.74 GPT-5.5
openai-gpt-5.5
Self-reported 2026-05-06
3 Gemini 3.1 Flash-Lite 0.60 Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Self-reported 2026-05-06
4 GPT-5.4 mini 0.34 GPT-5.4 Mini
openai-gpt-5.4-mini
Self-reported 2026-05-06
5 GPT-5.4 nano 0.33 GPT-5.4 Nano
openai-gpt-5.4-nano
Self-reported 2026-05-06
6 Gemini 3 Pro 0.26 Gemini 3
google-gemini-3
Self-reported 2026-05-06
6 Gemini 3.1 Pro 0.26 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Self-reported 2026-05-06
8 Gemini 3 Flash 0.22 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
9 Gemini 2.5 Pro Preview 06-05 0.16 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Self-reported 2026-05-06