OSUniverse

Benchmark of complex multimodal desktop GUI-navigation tasks for advanced agents, with automated validation and task levels spanning basic precision clicking to multi-application workflows.

8rows
averageprimary metric
2026-05-27sampled

Metadata

Metrics

Total Score, Paper, Wood, Bronze, Silver, Gold

Latest Results

Rows are parsed from the public OSUniverse project page Agent Performance table. Primary score is the reported Total Score across task levels.

Rank Subject Total Score Model Match Provenance Sampled
1 Computer Use Agent with computer-use-preview-2025-03-11 47.8% Imported 2026-05-27
2 Claude Computer Use with claude-3-5-sonnet-20241022 28.36% Imported 2026-05-27
3 AgentDesk-based ReACT with claude-3-5-sonnet-20241022 23.44% Imported 2026-05-27
4 QWEN-based ReACT with qwen2.5-vl-72b-instruct 18.64% Imported 2026-05-27
5 AgentDesk-based ReACT with gemini-2.5-pro-exp-03-25 9.59% Imported 2026-05-27
6 AgentDesk-based ReACT with gemini-2.0-flash-001 8.26% Imported 2026-05-27
7 AgentDesk-based ReACT with gpt-4o-2024-11-20 6.79% Imported 2026-05-27
8 AgentDesk-based ReACT with gemini-1.5-pro-002 6.12% Imported 2026-05-27