Online Mind2Web (HAL)

HAL's standardized, cost-aware agent leaderboard for Online Mind2Web web navigation tasks.

22rows
accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Accuracy, Cost (USD) (lower is better), Runs

Latest Results

Rows are parsed from the public HAL static leaderboard table. Source scaffold/model display names are preserved; score is the table's Accuracy percentage.

Rank Subject Accuracy Model Match Provenance Sampled
1 SeeAct / GPT-5 Medium (August 2025) 42.33 Verified 2026-05-27
2 Browser-Use / Claude Sonnet 4 (May 2025) 40 Verified 2026-05-27
3 Browser-Use / Claude-3.7 Sonnet High (February 2025) 39.33 Verified 2026-05-27
4 Browser-Use / Claude Sonnet 4 High (May 2025) 39.33 Verified 2026-05-27
5 SeeAct / o3 Medium (April 2025) 39 Verified 2026-05-27
6 Browser-Use / Claude-3.7 Sonnet (February 2025) 38.33 Verified 2026-05-27
7 SeeAct / Claude Sonnet 4 (May 2025) 36.67 Verified 2026-05-27
8 SeeAct / Claude Sonnet 4 High (May 2025) 36.67 Verified 2026-05-27
9 Browser-Use / GPT-4.1 (April 2025) 36.33 Verified 2026-05-27
10 Browser-Use / DeepSeek V3 (March 2025) 32.33 Verified 2026-05-27
11 SeeAct / o4-mini High (April 2025) 32 Verified 2026-05-27
12 Browser-Use / GPT-5 Medium (August 2025) 32 Verified 2026-05-27
13 SeeAct / o4-mini Low (April 2025) 31.67 Verified 2026-05-27
14 SeeAct / GPT-4.1 (April 2025) 30.33 Verified 2026-05-27
15 SeeAct / Claude-3.7 Sonnet High (February 2025) 30.33 Verified 2026-05-27
16 Browser-Use / Gemini 2.0 Flash (February 2025) 29 Verified 2026-05-27
17 Browser-Use / o3 Medium (April 2025) 29 Verified 2026-05-27
18 SeeAct / Claude-3.7 Sonnet (February 2025) 28.33 Verified 2026-05-27
19 SeeAct / Gemini 2.0 Flash (February 2025) 26.67 Verified 2026-05-27
20 Browser-Use / DeepSeek R1 (January 2025) 25.33 Verified 2026-05-27
21 Browser-Use / o4-mini High (April 2025) 20 Verified 2026-05-27
22 Browser-Use / o4-mini Low (April 2025) 18.33 Verified 2026-05-27