OSWorld-MCP
Benchmark for MCP tool invocation in computer-use agents on OSWorld-style desktop tasks.
14rows
accuracyprimary metric
2026-05-06sampled
Metadata
Metrics
Acc, TIR, ACS (lower is better)
| Rank | Subject | Acc | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Agent-S2.5 | 49.50 | — | Imported | 2026-05-06 |
| 2 | Claude 4 Sonnet | 45 | — | Imported | 2026-05-06 |
| 3 | Agent-S2.5 | 42.10 | — | Imported | 2026-05-06 |
| 4 | Qwen3-VL | 39.50 | — | Imported | 2026-05-06 |
| 5 | Seed1.5-VL | 38.20 | — | Imported | 2026-05-06 |
| 6 | Claude 4 Sonnet | 36.10 | — | Imported | 2026-05-06 |
| 7 | Qwen3-VL | 32.80 | — | Imported | 2026-05-06 |
| 8 | Seed1.5-VL | 30.70 | — | Imported | 2026-05-06 |
| 9 | Gemini-2.5-Pro | 25.70 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 10 | OpenAI o3 | 24.10 | o3 openai-o3 | Imported | 2026-05-06 |
| 11 | OpenAI o3 | 17.60 | o3 openai-o3 | Imported | 2026-05-06 |
| 12 | Gemini-2.5-Pro | 17.40 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 13 | Qwen2.5-VL | 15.60 | — | Imported | 2026-05-06 |
| 14 | Qwen2.5-VL | 14.50 | — | Imported | 2026-05-06 |
No matching rows.