OSWorld-MCP

Benchmark for MCP tool invocation in computer-use agents on OSWorld-style desktop tasks.

14rows
accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Acc, TIR, ACS (lower is better)

Latest Results

Source model display names are preserved. Step-limit variants are disambiguated by subject_id and metadata.step_limit.

Rank Subject Acc Model Match Provenance Sampled
1 Agent-S2.5 49.50 Imported 2026-05-06
2 Claude 4 Sonnet 45 Imported 2026-05-06
3 Agent-S2.5 42.10 Imported 2026-05-06
4 Qwen3-VL 39.50 Imported 2026-05-06
5 Seed1.5-VL 38.20 Imported 2026-05-06
6 Claude 4 Sonnet 36.10 Imported 2026-05-06
7 Qwen3-VL 32.80 Imported 2026-05-06
8 Seed1.5-VL 30.70 Imported 2026-05-06
9 Gemini-2.5-Pro 25.70 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
10 OpenAI o3 24.10 o3
openai-o3
Imported 2026-05-06
11 OpenAI o3 17.60 o3
openai-o3
Imported 2026-05-06
12 Gemini-2.5-Pro 17.40 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
13 Qwen2.5-VL 15.60 Imported 2026-05-06
14 Qwen2.5-VL 14.50 Imported 2026-05-06