LongDocURL

Long-document understanding benchmark for locating and reasoning over evidence across very large document collections.

7rows
totalprimary metric
2026-05-27sampled

Metadata

Metrics

Understanding, Reasoning, Locating, Text, Layout, Figure, Table, Single Page, Multi Page, Cross Element, Total

Latest Results

Rows parsed from the public LongDocURL leaderboard. Scores are displayed by task type, evidence element, page/element grouping, and total.

Rank Subject Total Model Match Provenance Sampled
1 GPT-4o-2024-05-13 --> GPT-4o-24-05-13 64.5% GPT-4o
openai-gpt-4o
Imported 2026-05-27
2 GPT-4o-2024-05-13 --> Gemini-1.5-Pro 50.9% GPT-4o
openai-gpt-4o
Imported 2026-05-27
3 GPT-4o-2024-05-13 --> Qwen-VL-Max 49.5% GPT-4o
openai-gpt-4o
Imported 2026-05-27
4 GPT-4o-2024-05-13 --> Qwen2-VL 30.6% GPT-4o
openai-gpt-4o
Imported 2026-05-27
5 GPT-4o-2024-05-13 --> LLaVA-OneVision-Chat 25% GPT-4o
openai-gpt-4o
Imported 2026-05-27
6 GPT-4o-2024-05-13 --> LLaVA-Next-Interleave-DPO 16.2% GPT-4o
openai-gpt-4o
Imported 2026-05-27
7 GPT-4o-2024-05-13 --> Llama-3.2 9.2% GPT-4o
openai-gpt-4o
Imported 2026-05-27