AITZ_EM
Android-In-The-Zoo (AitZ) benchmark for evaluating autonomous GUI agents on smartphones. Contains 18,643 screen-action pairs with chain-of-action-thought annotations spanning over 70 Android apps. Designed to connect perception (screen layouts and UI elements) with cognition (action decision-making) for natural language-triggered smartphone task completion.
3rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Qwen2.5 VL 72B Instruct | 0.83 | Qwen2.5 VL 72B Instruct qwen-qwen2.5-vl-72b-instruct | Self-reported | 2026-05-06 |
| 2 | Qwen2.5 VL 32B Instruct | 0.83 | — | Self-reported | 2026-05-06 |
| 3 | Qwen2.5 VL 7B Instruct | 0.82 | — | Self-reported | 2026-05-06 |
No matching rows.