Android Bench

Android development benchmark measuring how well LLM agents resolve real Android tasks, with success rate, latency, token use, and cost.

28rows
score_pctprimary metric
2026-05-27sampled

Metadata

Metrics

Score, Average Latency (lower is better), Average Total Tokens (lower is better), Average Cost (lower is better)

Latest Results

Rows parsed from the Android Bench public static leaderboard. Score is the average percentage of 100 Android development tasks successfully resolved across 10 runs.

Rank Subject Score Model Match Provenance Sampled
1 GPT 5.5 74.0 Imported 2026-05-27
2 GPT 5.4 72.4 Imported 2026-05-27
3 Gemini 3.1 Pro Preview 72.4 Imported 2026-05-27
4 Claude Opus 4 7 68.7 Imported 2026-05-27
5 GPT 5.3 Codex 67.7 Imported 2026-05-27
6 Claude Opus 4 6 66.6 Imported 2026-05-27
7 GPT 5.2 Codex 62.5 Imported 2026-05-27
8 Claude Opus 4.5 61.9 Imported 2026-05-27
9 Gemini 3 Pro Preview 60.4 Imported 2026-05-27
10 GLM 5.1 59.7 Imported 2026-05-27
11 Claude Sonnet 4.6 58.4 Imported 2026-05-27
12 Kimi K2.6 58.6 Imported 2026-05-27
13 DeepSeek V4 Pro 55.4 Imported 2026-05-27
14 Claude Sonnet 4.5 54.2 Imported 2026-05-27
15 DeepSeek V4 Flash 52.7 Imported 2026-05-27
16 MiMo 2.5 Pro 52.0 Imported 2026-05-27
17 Qwen 3.6 Max Preview 51.4 Imported 2026-05-27
18 Gemini 3 Flash Preview 42.0 Imported 2026-05-27
19 MiniMax M2.7 37.2 Imported 2026-05-27
20 Qwen 3.6 27B 37.4 Imported 2026-05-27
21 Gemma 4 31B IT 33.2 Imported 2026-05-27
22 Qwen 3.6 35B A3B 31.7 Imported 2026-05-27
23 Gemini 2.5 Pro 29.1 Imported 2026-05-27
24 Gemma 4 26B A4B IT 25.1 Imported 2026-05-27
25 GPT OSS 120B 18.9 Imported 2026-05-27
26 Gemini 2.5 Flash 15.9 Imported 2026-05-27
27 Qwen 3.5 9B 15.5 Imported 2026-05-27
28 GPT OSS 20B 2.4 Imported 2026-05-27