WorkArena-L3

BrowserGym leaderboard slice for WorkArena-L3, evaluating web agents on harder compositional ServiceNow knowledge-work tasks.

8rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 GenericAgent-GPT-5 11.50 Imported 2026-05-06
2 GenericAgent-Claude-3.5-Sonnet 0.40 Imported 2026-05-06
3 GenericAgent-AgentTrek-1.0-32b 0 Imported 2026-05-06
4 GenericAgent-GPT-4o-mini 0 Imported 2026-05-06
5 GenericAgent-GPT-4o 0 Imported 2026-05-06
6 GenericAgent-GPT-o1-mini 0 Imported 2026-05-06
7 GenericAgent-Llama-3.1-405b 0 Imported 2026-05-06
8 GenericAgent-Llama-3.1-70b 0 Imported 2026-05-06