AgentBoard
AgentBoard: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
13rows
progress_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Progress Rate, Success Rate, Grounding Accuracy
| Rank | Subject | Progress Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4 | 70% | — | Imported | 2026-05-27 |
| 2 | Claude2 | 48.9% | — | Imported | 2026-05-27 |
| 3 | GPT-3.5-Turbo | 41.4% | — | Imported | 2026-05-27 |
| 4 | DeepSeek-67b | 38.5% | — | Imported | 2026-05-27 |
| 5 | Text-Davinci-003 | 37.2% | — | Imported | 2026-05-27 |
| 6 | GPT-3.5-Turbo-16k | 34.2% | — | Imported | 2026-05-27 |
| 7 | Lemur-70b | 30.8% | — | Imported | 2026-05-27 |
| 8 | CodeLlama-34b | 30% | — | Imported | 2026-05-27 |
| 9 | CodeLlama-13b | 25.8% | — | Imported | 2026-05-27 |
| 10 | Mistral-7b | 24.6% | — | Imported | 2026-05-27 |
| 11 | Llama2-70b | 23.8% | — | Imported | 2026-05-27 |
| 12 | Vicuna-13b-16k | 23.1% | — | Imported | 2026-05-27 |
| 13 | Llama2-13b | 18.9% | — | Imported | 2026-05-27 |
No matching rows.