AndroidWorld
AndroidWorld: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
43rows
pass_at_1_success_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Success Rate (pass@1), Success Rate (pass@k), Number of trials
| Rank | Subject | Success Rate (pass@1) | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | AGI-0 | 97.4% | — | Imported | 2026-05-27 |
| 2 | Gemini 3 flash, Gemini 3 flash lite | 97.4% | — | Imported | 2026-05-27 |
| 3 | Seed1.8-GUI | 97.4% | — | Imported | 2026-05-27 |
| 4 | askui AndroidVisionAgent, Claude 4.5 Sonnet + Claude 4.0 Sonnet | 94.8% | — | Imported | 2026-05-27 |
| 5 | gemini 3 pro + sonnet 4.5 | 94.8% | — | Imported | 2026-05-27 |
| 6 | GPT5, Gemini 2.5 Pro | 91.4% | — | Imported | 2026-05-27 |
| 7 | Llama 4-scout, Gemini 2.5 pro, GPT-5 nano | 91.4% | — | Imported | 2026-05-27 |
| 8 | - | 88.8% | — | Imported | 2026-05-27 |
| 9 | o3 + holo1.5-72b | 87.1% | — | Imported | 2026-05-27 |
| 10 | Sonnet 4.5 + Sonnet 4 | 86.2% | — | Imported | 2026-05-27 |
| 11 | AutoGLM-Mobile | 80.2% | — | Imported | 2026-05-27 |
| 12 | Human | 80% | — | Imported | 2026-05-27 |
| 13 | LX-GUIAgent | 79.3% | — | Imported | 2026-05-27 |
| 14 | Gemini-2.5-Pro+UI-TARS-1.5 | 78% | — | Imported | 2026-05-27 |
| 15 | MAI-UI-235B-A22B | 76.7% | — | Imported | 2026-05-27 |
| 16 | Qwen2.5-VL-72B + Qwen2.5-VL-7B | 76.7% | — | Imported | 2026-05-27 |
| 17 | Hammer-UI-32B | 75% | — | Imported | 2026-05-27 |
| 18 | GUI-Owl-32B | 73.3% | — | Imported | 2026-05-27 |
| 19 | MAI-UI-32B | 73.3% | — | Imported | 2026-05-27 |
| 20 | MAI-UI-8B | 70.7% | — | Imported | 2026-05-27 |
| 21 | Gemini 2.5 Computer Use | 69.7% | — | Imported | 2026-05-27 |
| 22 | JT-GUIAgent-V2 | 67.2% | — | Imported | 2026-05-27 |
| 23 | GUI-Owl-7B | 66.4% | — | Imported | 2026-05-27 |
| 24 | UI-Venus-Navi-72B | 65.9% | — | Imported | 2026-05-27 |
| 25 | Qwen2.5-VL-72B | 62.9% | — | Imported | 2026-05-27 |
| 26 | Seed1.5-VL | 62.1% | — | Imported | 2026-05-27 |
| 27 | JT-GUIAgent-V1 | 60% | — | Imported | 2026-05-27 |
| 28 | V-Droid (Llama8B) | 59.5% | — | Imported | 2026-05-27 |
| 29 | Agent S2 | 54.3% | — | Imported | 2026-05-27 |
| 30 | MAI-UI-2B | 49.1% | — | Imported | 2026-05-27 |
| 31 | Venus-Navi-7B | 49.1% | — | Imported | 2026-05-27 |
| 32 | GPT-4o | 47.4% | — | Imported | 2026-05-27 |
| 33 | GPT-4o | 46.8% | — | Imported | 2026-05-27 |
| 34 | UI-TARS | 46.6% | — | Imported | 2026-05-27 |
| 35 | GPT-4o + Aria-UI | 44.8% | — | Imported | 2026-05-27 |
| 36 | GPT-4o + UGround | 44% | — | Imported | 2026-05-27 |
| 37 | ScaleTrack-7B | 44% | — | Imported | 2026-05-27 |
| 38 | GPT-4o | 42.2% | — | Imported | 2026-05-27 |
| 39 | GPT-4o | 34.5% | — | Imported | 2026-05-27 |
| 40 | GPT-4 Turbo | 30.6% | — | Imported | 2026-05-27 |
| 41 | GPT-4o, OS-Atlas-Pro 4B, Qwen2-VL-2B-Instruct | 27.6% | — | Imported | 2026-05-27 |
| 42 | Qwen-2.5-VL-7B | 27.6% | — | Imported | 2026-05-27 |
| 43 | Qwen2-VL-2B (fine-tuned) | 9% | — | Imported | 2026-05-27 |
No matching rows.