WorkArena-L2

BrowserGym leaderboard slice for WorkArena-L2, evaluating web agents on compositional ServiceNow knowledge-work tasks.

14rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 GenericAgent-GPT-5 69.40 Imported 2026-05-06
2 GenericAgent-GPT-5-mini 47.70 Imported 2026-05-06
3 GenericAgent-Claude-4-Sonnet 40.40 Imported 2026-05-06
4 GenericAgent-Claude-3.5-Sonnet 39.10 Imported 2026-05-06
5 GenericAgent-GPT-o1-mini 14.90 Imported 2026-05-06
6 GenericAgent-GPT-oss-120b 11.50 Imported 2026-05-06
7 A3-Qwen3.5-9B 10.60 Imported 2026-05-06
8 GenericAgent-GPT-4o 8.50 Imported 2026-05-06
9 GenericAgent-Llama-3.1-405b 7.20 Imported 2026-05-06
10 GenericAgent-GPT-5-nano 3.40 Imported 2026-05-06
11 GenericAgent-AgentTrek-1.0-32b 2.98 Imported 2026-05-06
12 GenericAgent-GPT-oss-20b 2.60 Imported 2026-05-06
13 GenericAgent-Llama-3.1-70b 2.10 Imported 2026-05-06
14 GenericAgent-GPT-4o-mini 1.30 Imported 2026-05-06