MLAgentBench

MLAgentBench: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

8rows
success_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Average success rate, Average improvement over baseline

Latest Results

Rows are transcribed from public MLAgentBench ICML 2024 paper Tables 3 and 4. Primary score is the average success rate row from Table 3; average improvement is the corresponding Average row from Table 4.

Rank Subject Average success rate Model Match Provenance Sampled
1 Claude v3 Opus 37.5% Imported 2026-05-27
2 Claude v2.1 26.0% Imported 2026-05-27
3 GPT-4-turbo 26.0% GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-27
4 GPT-4 19.2% GPT-4
openai-gpt-4
Imported 2026-05-27
5 Gemini Pro 18.3% Imported 2026-05-27
6 Claude v1.0 16.3% Imported 2026-05-27
7 Baseline 10.4% Imported 2026-05-27
8 Mixtral 3.8% Imported 2026-05-27