OSUniverse
Benchmark of complex multimodal desktop GUI-navigation tasks for advanced agents, with automated validation and task levels spanning basic precision clicking to multi-application workflows.
8rows
averageprimary metric
2026-05-27sampled
Metadata
Metrics
Total Score, Paper, Wood, Bronze, Silver, Gold
| Rank | Subject | Total Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Computer Use Agent with computer-use-preview-2025-03-11 | 47.8% | — | Imported | 2026-05-27 |
| 2 | Claude Computer Use with claude-3-5-sonnet-20241022 | 28.36% | — | Imported | 2026-05-27 |
| 3 | AgentDesk-based ReACT with claude-3-5-sonnet-20241022 | 23.44% | — | Imported | 2026-05-27 |
| 4 | QWEN-based ReACT with qwen2.5-vl-72b-instruct | 18.64% | — | Imported | 2026-05-27 |
| 5 | AgentDesk-based ReACT with gemini-2.5-pro-exp-03-25 | 9.59% | — | Imported | 2026-05-27 |
| 6 | AgentDesk-based ReACT with gemini-2.0-flash-001 | 8.26% | — | Imported | 2026-05-27 |
| 7 | AgentDesk-based ReACT with gpt-4o-2024-11-20 | 6.79% | — | Imported | 2026-05-27 |
| 8 | AgentDesk-based ReACT with gemini-1.5-pro-002 | 6.12% | — | Imported | 2026-05-27 |
No matching rows.