FRAMES

Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augmented generation systems across factuality, retrieval accuracy, and reasoning capabilities, requiring integration of 2-15 Wikipedia articles per question

2rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Kimi K2-Thinking-0905 0.87 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Self-reported 2026-05-06
2 DeepSeek-V3 0.73 DeepSeek V3
deepseek-deepseek-chat
Self-reported 2026-05-06