WorkArena-L1

BrowserGym leaderboard slice for WorkArena-L1, evaluating web agents on atomic ServiceNow knowledge-work tasks.

17rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Std. Err. (lower is better)

Latest Results

Rows ranked by highest BrowserGym score. Evaluation protocol metadata is preserved from source JSON files.

Rank Subject Score Model Match Provenance Sampled
1 IpaziaHPA-Gemini-3-flash-preview 90.30 Imported 2026-05-06
2 GenericAgent-GPT-5 79.10 Imported 2026-05-06
3 GenericAgent-Claude-4-Sonnet 63.30 Imported 2026-05-06
4 GenericAgent-GPT-5-mini 60.60 Imported 2026-05-06
5 GenericAgent-GPT-o1-mini 56.70 Imported 2026-05-06
6 GenericAgent-Claude-3.5-Sonnet 56.40 Imported 2026-05-06
7 GenericAgent-GPT-o1-mini 51.80 Imported 2026-05-06
8 A3-Qwen3.5-9B 51.50 Imported 2026-05-06
9 GenericAgent-GPT-oss-120b 50.90 Imported 2026-05-06
10 GenericAgent-o3-mini 48.20 Imported 2026-05-06
11 GenericAgent-GPT-4o 45.50 Imported 2026-05-06
12 GenericAgent-Llama-3.1-405b 43.30 Imported 2026-05-06
13 GenericAgent-GPT-5-nano 40.60 Imported 2026-05-06
14 GenericAgent-GPT-oss-20b 38.50 Imported 2026-05-06
15 GenericAgent-AgentTrek-1.0-32b 38.29 Imported 2026-05-06
16 GenericAgent-Llama-3.1-70b 27.90 Imported 2026-05-06
17 GenericAgent-GPT-4o-mini 27 Imported 2026-05-06