LingOly-TOO

Linguistics reasoning benchmark evaluating models on baseline and obfuscated questions to separate reasoning ability from memorization.

16rows
obfuscated_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Obfuscated Score, Baseline Score

Latest Results

Rank Subject Obfuscated Score Model Match Provenance Sampled
1 GPT-5 0.47 GPT-5
openai-gpt-5
Imported 2026-05-06
2 Claude Opus 4.1 0.46 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
3 Claude 3.7 Sonnet 0.43 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
4 Gemini 2.5 Pro 0.42 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
5 DeepSeek-V3.1-Terminus 0.42 DeepSeek V3.1 Terminus
deepseek-deepseek-v3.1-terminus
Imported 2026-05-06
6 o1-preview 0.32 o1-preview
openai-o1-preview
Imported 2026-05-06
7 o3-mini (high) 0.31 o3 Mini High
openai-o3-mini-high
Imported 2026-05-06
8 Claude 3.5 Sonnet 0.28 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
9 DeepSeek R1 0.26 R1
deepseek-r1
Imported 2026-05-06
10 GPT 4.5 0.25 GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-06
11 Gemini 1.5 Pro 0.20 Imported 2026-05-06
12 GPT 4o 0.16 GPT-4o
openai-gpt-4o
Imported 2026-05-06
13 o3-mini (low) 0.12 o3-mini
openai-o3-mini
Imported 2026-05-06
14 Phi4 0.11 Phi 4
microsoft-phi-4
Imported 2026-05-06
15 Llama 3.3 70B-Instruct 0.08 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
16 Aya 23 35B 0.06 Imported 2026-05-06