Agentick

Universal benchmark for evaluating sequential decision agents across RL, LLM, VLM, modality, harness, task, and category settings.

29rows
oracle_normalized_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Oracle-Normalized Score

Latest Results

Rows parsed from the Agentick full-benchmark static leaderboard. ONS is oracle-normalized with 0.0 as random baseline and 1.0 as oracle upper bound.

Rank Subject Oracle-Normalized Score Model Match Provenance Sampled
1 Oracle Agent 0.895 Imported 2026-05-27
2 Qwen3.5-4B (SFT-250k) 0.447 Imported 2026-05-27
3 Qwen3.5-4B (SFT-250k) 0.444 Imported 2026-05-27
4 Qwen3.5-4B (SFT-120k) 0.354 Imported 2026-05-27
5 Qwen3.5-4B (SFT-120k) 0.349 Imported 2026-05-27
6 GPT-5 mini 0.309 Imported 2026-05-27
7 PPO Dense (2M) 0.287 Imported 2026-05-27
8 Qwen3.5-4B 0.228 Imported 2026-05-27
9 PPO Dense (500k) 0.226 Imported 2026-05-27
10 Gemini 2.5 Flash Lite 0.187 Imported 2026-05-27
11 Qwen3.5-4B 0.181 Imported 2026-05-27
12 Qwen3.5-2B 0.133 Imported 2026-05-27
13 Qwen3.5-2B 0.122 Imported 2026-05-27
14 Qwen3.5-0.8B 0.094 Imported 2026-05-27
15 Qwen3-4B 0.085 Imported 2026-05-27
16 Random Agent 0.082 Imported 2026-05-27
17 PPO Sparse (500k) 0.074 Imported 2026-05-27
18 Gemini 2.5 Flash Lite 0.064 Imported 2026-05-27
19 Qwen3.5-2B 0.062 Imported 2026-05-27
20 Qwen3.5-0.8B 0.061 Imported 2026-05-27
21 Gemini 2.5 Flash Lite 0.053 Imported 2026-05-27
22 Qwen3-4B 0.050 Imported 2026-05-27
23 Qwen3.5-2B 0.031 Imported 2026-05-27
24 Qwen3.5-4B 0.023 Imported 2026-05-27
25 Qwen3.5-4B 0.020 Imported 2026-05-27
26 Qwen3.5-0.8B 0.020 Imported 2026-05-27
27 Qwen3-4B 0.020 Imported 2026-05-27
28 Qwen3-4B 0.019 Imported 2026-05-27
29 Qwen3.5-0.8B 0.016 Imported 2026-05-27