AutoLab

AutoLab evaluates AI agents on iterative performance-engineering tasks across model development, puzzle/challenge tasks, and system optimization.

7rows
overall_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Overall Score, Model Development Score, Puzzle and Challenge Score, System Optimization Score, Tasks

Latest Results

Rows are parsed from the static AutoLab Next.js leaderboard payload and ranked by raw_reward overall_score. Source model display names and per-task raw values are preserved.

Rank Subject Overall Score Model Match Provenance Sampled
1 Claude Opus 4.6 0.85 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 Gemini 3.1 Pro 0.71 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
3 MiMo V2 Pro 0.64 MiMo-V2-Pro
xiaomi-mimo-v2-pro
Imported 2026-05-06
4 GLM-5 0.60 GLM GLM 5
z-ai-glm-5
Imported 2026-05-06
5 GPT-5.4 0.56 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
6 Kimi K2.5 0.55 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
7 Qwen 3.5 Plus 0.54 Qwen3.5 Plus 2026-04-20
qwen-qwen3.5-plus-20260420
Imported 2026-05-06