Realm Warren

micro1 legal reasoning benchmark set in realistic litigation, transactional, and compliance contexts, evaluating long-horizon legal work products with IRAC-decomposed rubrics.

3rows
mean_scoreprimary metric
2026-05-07sampled

Metadata

Metrics

Mean Weighted Reward, Pass@3, Median Weighted Reward

Latest Results

Mean, median, and confidence intervals are from the model comparison table. Pass@3 scores are from the report header chart.

Rank Subject Mean Weighted Reward Model Match Provenance Sampled
1 Claude Opus 4.7 0.36 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-07
2 GPT-5.5 0.35 GPT-5.5
openai-gpt-5.5
Imported 2026-05-07
3 Gemini 3.1 Pro 0.22 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-07