MiniWoB++

BrowserGym leaderboard slice for MiniWoB++, evaluating agents on small browser interaction tasks.

16rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 OrbyAgent-Claude-3.5-Sonnet 74.90 Imported 2026-05-06
2 GenericAgent-GPT-5 71.50 Imported 2026-05-06
3 GenericAgent-GPT-5-mini 71 Imported 2026-05-06
4 GenericAgent-Claude-4-Sonnet 70.70 Imported 2026-05-06
5 GenericAgent-Claude-3.5-Sonnet 69.80 Imported 2026-05-06
6 A3-Qwen3.5-9B 69 Imported 2026-05-06
7 GenericAgent-GPT-o1-mini 67.80 Imported 2026-05-06
8 GenericAgent-GPT-oss-120b 66.40 Imported 2026-05-06
9 GenericAgent-GPT-5-nano 64.80 Imported 2026-05-06
10 GenericAgent-Llama-3.1-405b 64.60 Imported 2026-05-06
11 OrbyAgent-ActIO-72b 64.20 Imported 2026-05-06
12 GenericAgent-GPT-oss-20b 64 Imported 2026-05-06
13 GenericAgent-GPT-4o 63.80 Imported 2026-05-06
14 GenericAgent-AgentTrek-1.0-32b 60 Imported 2026-05-06
15 GenericAgent-Llama-3.1-70b 57.60 Imported 2026-05-06
16 GenericAgent-GPT-4o-mini 56.60 Imported 2026-05-06