Graphwalks BFS <128k

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.

11rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.2 0.94 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
2 GPT-5.4 0.93 GPT-5.4
openai-gpt-5.4
Self-reported 2026-05-06
3 GPT-5 0.78 GPT-5
openai-gpt-5
Self-reported 2026-05-06
4 GPT-5.4 mini 0.76 GPT-5.4 Mini
openai-gpt-5.4-mini
Self-reported 2026-05-06
5 GPT-5.4 nano 0.73 GPT-5.4 Nano
openai-gpt-5.4-nano
Self-reported 2026-05-06
6 GPT-4.5 0.72 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
7 GPT-4.1 0.62 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
7 GPT-4.1 mini 0.62 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
9 o3-mini 0.51 o3-mini
openai-o3-mini
Self-reported 2026-05-06
10 GPT-4o 0.42 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
11 GPT-4.1 nano 0.25 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06