VisualWebArena

VisualWebArena: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.

24rows
success_rateprimary metric
2026-05-05sampled

Metadata

Metrics

Success Rate

Latest Results

Rank Subject Success Rate Model Match Provenance Sampled
1 Human Performance + - 88.70 Imported 2026-05-05
2 GPT-4o 19.78 Imported 2026-05-05
3 GPT-4V 16.37 Imported 2026-05-05
4 GPT-4V 15.05 Imported 2026-05-05
5 GPT-4 12.75 Imported 2026-05-05
6 Gemini-Pro-1.5 11.98 Imported 2026-05-05
7 LLaMA-3-70B-Instruct 9.78 Imported 2026-05-05
8 GPT-4 7.25 Imported 2026-05-05
9 Gemini-Flash-1.5 6.59 Imported 2026-05-05
10 Gemini-Pro 6.04 Imported 2026-05-05
11 Gemini-Pro 5.71 Imported 2026-05-05
12 Gemini-Pro 3.85 Imported 2026-05-05
13 GPT-3.5 2.97 Imported 2026-05-05
14 GPT-3.5 2.75 Imported 2026-05-05
15 Gemini-Pro 2.20 Imported 2026-05-05
16 GPT-3.5 2.20 Imported 2026-05-05
17 Mixtral-8x7B 1.87 Imported 2026-05-05
18 Mixtral-8x7B 1.76 Imported 2026-05-05
19 Text-only + LLaMA-2-70B + - 1.10 Imported 2026-05-05
20 Multimodal (SoM) + IDEFICS-80B-Instruct 0.99 Imported 2026-05-05
21 Multimodal + IDEFICS-80B-Instruct 0.77 Imported 2026-05-05
22 Caption-augmented + LLaMA-2-70B + BLIP-2-T5XL 0.66 Imported 2026-05-05
23 CogVLM 0.33 Imported 2026-05-05
24 CogVLM 0.33 Imported 2026-05-05