AgentBoard

AgentBoard: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

13rows
progress_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Progress Rate, Success Rate, Grounding Accuracy

Latest Results

Rows are parsed from the public JSON payload loaded by the AgentBoard static leaderboard. Score uses the Avg progress-rate field.

Rank Subject Progress Rate Model Match Provenance Sampled
1 GPT-4 70% Imported 2026-05-27
2 Claude2 48.9% Imported 2026-05-27
3 GPT-3.5-Turbo 41.4% Imported 2026-05-27
4 DeepSeek-67b 38.5% Imported 2026-05-27
5 Text-Davinci-003 37.2% Imported 2026-05-27
6 GPT-3.5-Turbo-16k 34.2% Imported 2026-05-27
7 Lemur-70b 30.8% Imported 2026-05-27
8 CodeLlama-34b 30% Imported 2026-05-27
9 CodeLlama-13b 25.8% Imported 2026-05-27
10 Mistral-7b 24.6% Imported 2026-05-27
11 Llama2-70b 23.8% Imported 2026-05-27
12 Vicuna-13b-16k 23.1% Imported 2026-05-27
13 Llama2-13b 18.9% Imported 2026-05-27