ANLI

Adversarial Natural Language Inference benchmark for challenging entailment reasoning.

9rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 Phi-3-small-8k-instruct 58.10 Imported 2026-05-06
2 gpt-3.5-turbo-1106 58.10 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
3 Meta-Llama-3-8B-Instruct 57.30 Llama 3 8B Instruct
meta-llama-llama-3-8b-instruct
Imported 2026-05-06
4 Phi-3-medium-128k-instruct 55.80 Imported 2026-05-06
5 Mixtral-8x7B-v0.1 55.20 Imported 2026-05-06
6 Phi-3-mini-4k-instruct 52.80 Imported 2026-05-06
7 gemma-7b 48.70 Imported 2026-05-06
8 Mistral-7B-v0.1 47.10 Imported 2026-05-06
9 Phi-4 42.50 Phi 4
microsoft-phi-4
Imported 2026-05-06