MLAgentBench
MLAgentBench: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
8rows
success_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Average success rate, Average improvement over baseline
| Rank | Subject | Average success rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude v3 Opus | 37.5% | — | Imported | 2026-05-27 |
| 2 | Claude v2.1 | 26.0% | — | Imported | 2026-05-27 |
| 3 | GPT-4-turbo | 26.0% | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-27 |
| 4 | GPT-4 | 19.2% | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 5 | Gemini Pro | 18.3% | — | Imported | 2026-05-27 |
| 6 | Claude v1.0 | 16.3% | — | Imported | 2026-05-27 |
| 7 | Baseline | 10.4% | — | Imported | 2026-05-27 |
| 8 | Mixtral | 3.8% | — | Imported | 2026-05-27 |
No matching rows.