VisualWebBench

A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding tasks. Comprises 7 tasks (captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding) with 1.5K human-curated instances from 139 real websites across 87 sub-domains.

2rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Nova Pro 0.80 Nova Pro 1.0
amazon-nova-pro-v1
Self-reported 2026-05-06
2 Nova Lite 0.78 Nova Lite 1.0
amazon-nova-lite-v1
Self-reported 2026-05-06