Android Control High_EM

Android device control benchmark using high exact match evaluation metric for assessing agent performance on mobile interface tasks

3rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Qwen2.5 VL 32B Instruct 0.70 Self-reported 2026-05-06
2 Qwen2.5 VL 72B Instruct 0.67 Qwen2.5 VL 72B Instruct
qwen-qwen2.5-vl-72b-instruct
Self-reported 2026-05-06
3 Qwen2.5 VL 7B Instruct 0.60 Self-reported 2026-05-06