FlenQA
Flexible Length Question Answering dataset for evaluating the impact of input length on reasoning performance of language models, featuring True/False questions embedded in contexts of varying lengths (250-3000 tokens) across three reasoning tasks: Monotone Relations, People In Rooms, and simplified Ruletaker
2rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Phi 4 Reasoning Plus | 0.98 | — | Self-reported | 2026-05-06 |
| 2 | Phi 4 Reasoning | 0.98 | — | Self-reported | 2026-05-06 |
No matching rows.