Android Bench
Android development benchmark measuring how well LLM agents resolve real Android tasks, with success rate, latency, token use, and cost.
28rows
score_pctprimary metric
2026-05-27sampled
Metadata
Metrics
Score, Average Latency (lower is better), Average Total Tokens (lower is better), Average Cost (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT 5.5 | 74.0 | — | Imported | 2026-05-27 |
| 2 | GPT 5.4 | 72.4 | — | Imported | 2026-05-27 |
| 3 | Gemini 3.1 Pro Preview | 72.4 | — | Imported | 2026-05-27 |
| 4 | Claude Opus 4 7 | 68.7 | — | Imported | 2026-05-27 |
| 5 | GPT 5.3 Codex | 67.7 | — | Imported | 2026-05-27 |
| 6 | Claude Opus 4 6 | 66.6 | — | Imported | 2026-05-27 |
| 7 | GPT 5.2 Codex | 62.5 | — | Imported | 2026-05-27 |
| 8 | Claude Opus 4.5 | 61.9 | — | Imported | 2026-05-27 |
| 9 | Gemini 3 Pro Preview | 60.4 | — | Imported | 2026-05-27 |
| 10 | GLM 5.1 | 59.7 | — | Imported | 2026-05-27 |
| 11 | Claude Sonnet 4.6 | 58.4 | — | Imported | 2026-05-27 |
| 12 | Kimi K2.6 | 58.6 | — | Imported | 2026-05-27 |
| 13 | DeepSeek V4 Pro | 55.4 | — | Imported | 2026-05-27 |
| 14 | Claude Sonnet 4.5 | 54.2 | — | Imported | 2026-05-27 |
| 15 | DeepSeek V4 Flash | 52.7 | — | Imported | 2026-05-27 |
| 16 | MiMo 2.5 Pro | 52.0 | — | Imported | 2026-05-27 |
| 17 | Qwen 3.6 Max Preview | 51.4 | — | Imported | 2026-05-27 |
| 18 | Gemini 3 Flash Preview | 42.0 | — | Imported | 2026-05-27 |
| 19 | MiniMax M2.7 | 37.2 | — | Imported | 2026-05-27 |
| 20 | Qwen 3.6 27B | 37.4 | — | Imported | 2026-05-27 |
| 21 | Gemma 4 31B IT | 33.2 | — | Imported | 2026-05-27 |
| 22 | Qwen 3.6 35B A3B | 31.7 | — | Imported | 2026-05-27 |
| 23 | Gemini 2.5 Pro | 29.1 | — | Imported | 2026-05-27 |
| 24 | Gemma 4 26B A4B IT | 25.1 | — | Imported | 2026-05-27 |
| 25 | GPT OSS 120B | 18.9 | — | Imported | 2026-05-27 |
| 26 | Gemini 2.5 Flash | 15.9 | — | Imported | 2026-05-27 |
| 27 | Qwen 3.5 9B | 15.5 | — | Imported | 2026-05-27 |
| 28 | GPT OSS 20B | 2.4 | — | Imported | 2026-05-27 |
No matching rows.