OSWorld-Verified
OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
23rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Normalized Score
Showing 4 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 83.4% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.7 | 82.8% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 3 | GPT-5.5 | 78.7% | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-28 |
| 4 | Gemini 3.1 Pro Preview | 76.2% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Self-reported | 2026-05-28 |
| 1 | Claude Mythos Preview | 0.80 | Claude Mythos Preview anthropic-claude-mythos-preview | Self-reported | 2026-05-06 |
| 2 | GPT-5.5 | 0.79 | GPT-5.5 openai-gpt-5.5 | Self-reported | 2026-05-06 |
| 3 | Claude Opus 4.7 | 0.78 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-06 |
| 4 | GPT-5.4 | 0.75 | GPT-5.4 openai-gpt-5.4 | Self-reported | 2026-05-06 |
| 5 | Kimi K2.6 | 0.73 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-06 |
| 6 | GPT-5.4 mini | 0.72 | GPT-5.4 Mini openai-gpt-5.4-mini | Self-reported | 2026-05-06 |
| 7 | GPT-5.3 Codex | 0.65 | GPT-5.3-Codex openai-gpt-5.3-codex | Self-reported | 2026-05-06 |
| 8 | Qwen3.6 Plus | 0.63 | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-06 |
| 9 | Qwen3.5-122B-A10B | 0.58 | Qwen3.5-122B-A10B qwen-qwen3.5-122b-a10b | Self-reported | 2026-05-06 |
| 10 | Qwen3.5-27B | 0.56 | Qwen3.5-27B qwen-qwen3.5-27b | Self-reported | 2026-05-06 |
| 11 | Qwen3.5-35B-A3B | 0.55 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Self-reported | 2026-05-06 |
| 12 | GPT-5.4 nano | 0.39 | GPT-5.4 Nano openai-gpt-5.4-nano | Self-reported | 2026-05-06 |
| 1 | GPT-5.5 | 78.7% | GPT-5.5 openai-gpt-5.5 | Launch post | 2026-04-23 |
| 2 | Claude Opus 4.7 | 78% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-23 |
| 3 | GPT-5.4 | 75% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-23 |
| 1 | Claude Mythos Preview | 79.6% | Claude Mythos Preview anthropic-claude-mythos-preview | Launch post | 2026-04-16 |
| 2 | Claude Opus 4.7 | 78% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Launch post | 2026-04-16 |
| 3 | GPT-5.4 | 75% | GPT-5.4 openai-gpt-5.4 | Launch post | 2026-04-16 |
| 4 | Claude Opus 4.6 | 72.7% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Launch post | 2026-04-16 |
No matching rows.