FRAMES
Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augmented generation systems across factuality, retrieval accuracy, and reasoning capabilities, requiring integration of 2-15 Wikipedia articles per question
2rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Kimi K2-Thinking-0905 | 0.87 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Self-reported | 2026-05-06 |
| 2 | DeepSeek-V3 | 0.73 | DeepSeek V3 deepseek-deepseek-chat | Self-reported | 2026-05-06 |
No matching rows.