AutoLab
AutoLab evaluates AI agents on iterative performance-engineering tasks across model development, puzzle/challenge tasks, and system optimization.
7rows
overall_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Overall Score, Model Development Score, Puzzle and Challenge Score, System Optimization Score, Tasks
| Rank | Subject | Overall Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 0.85 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | Gemini 3.1 Pro | 0.71 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 3 | MiMo V2 Pro | 0.64 | MiMo-V2-Pro xiaomi-mimo-v2-pro | Imported | 2026-05-06 |
| 4 | GLM-5 | 0.60 | GLM 5 z-ai-glm-5 | Imported | 2026-05-06 |
| 5 | GPT-5.4 | 0.56 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 6 | Kimi K2.5 | 0.55 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 7 | Qwen 3.5 Plus | 0.54 | Qwen3.5 Plus 2026-04-20 qwen-qwen3.5-plus-20260420 | Imported | 2026-05-06 |
No matching rows.