RULER

RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.

3rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Nemotron 3 Super (120B A12B) 0.92 Nemotron 3 Super
nvidia-nemotron-3-super-120b-a12b
Self-reported 2026-05-06
2 Phi-3.5-MoE-instruct 0.87 Self-reported 2026-05-06
3 Phi-3.5-mini-instruct 0.84 Self-reported 2026-05-06