URIAL Bench

URIAL Bench evaluates base language models prompted with Untuned LLMs with Restyled In-context ALignment on MT-Bench-style multi-turn tasks.

19rows
overallprimary metric
2026-05-06sampled

Metadata

Metrics

Overall, Turn 1, Turn 2, Coding, Extraction, Humanities, Math, Reasoning, Roleplay, STEM, Writing

Latest Results

Snapshot mirrors the public URIAL Bench static JSONL. Rows are model variants because the benchmark evaluates models under the URIAL prompting method.

Rank Subject Overall Model Match Provenance Sampled
1 gpt-4 8.99 GPT-4
openai-gpt-4
Imported 2026-05-06
2 gpt-3.5-turbo 7.94 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
3 dbrx 7.22 Imported 2026-05-06
4 Llama-2-70b-hf 7.11 Imported 2026-05-06
5 Mixtral-8x7B-v0.1 6.94 Imported 2026-05-06
6 Mistral-7b-v0.1 6.67 Imported 2026-05-06
7 Yi-34B 6.67 Imported 2026-05-06
8 phi-2-vllm 6.06 Imported 2026-05-06
9 gemma-7b 6.00 Imported 2026-05-06
10 phi-2 5.85 Imported 2026-05-06
11 Llama-2-13b-hf 5.34 Imported 2026-05-06
12 Yi-6B 4.97 Imported 2026-05-06
13 Llama-2-7b-hf 4.83 Imported 2026-05-06
14 gemma-2b 3.97 Imported 2026-05-06
15 olmo 3.41 Imported 2026-05-06
16 olmo-7b-vllm 3.38 Imported 2026-05-06
17 falcon-7b 3.10 Imported 2026-05-06
18 mpt-7b 1.49 Imported 2026-05-06
19 amber 1.44 Imported 2026-05-06