Graphwalks parents <128k

A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length under 128k tokens, requiring understanding of graph structure and edge relationships.

11rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.4 0.90 GPT-5.4
openai-gpt-5.4
Self-reported 2026-05-06
2 GPT-5.2 0.89 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
3 GPT-5 0.73 GPT-5
openai-gpt-5
Self-reported 2026-05-06
4 GPT-4.5 0.73 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
5 GPT-5.4 mini 0.71 GPT-5.4 Mini
openai-gpt-5.4-mini
Self-reported 2026-05-06
6 GPT-4.1 mini 0.60 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
7 o3-mini 0.58 o3-mini
openai-o3-mini
Self-reported 2026-05-06
8 GPT-4.1 0.58 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
9 GPT-5.4 nano 0.51 GPT-5.4 Nano
openai-gpt-5.4-nano
Self-reported 2026-05-06
10 GPT-4o 0.35 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
11 GPT-4.1 nano 0.09 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06