AgentQuest

AgentQuest: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

6rows
success_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Success Rate, Steps (lower is better), Progress Rate, Repetition Rate (lower is better)

Latest Results

Rows are imported from public arXiv source LaTeX Table 3. The paper states this is a reference-agent demonstration over AgentQuest tasks rather than a thorough agent leaderboard.

Rank Subject Success Rate Model Match Provenance Sampled
1 Modified LangChain chat agent + GPT-4 on ALFWorld 93% Imported 2026-05-28
2 LangChain chat agent + GPT-4 on ALFWorld 86% Imported 2026-05-28
3 Modified LangChain chat agent + GPT-4 on Mastermind 60% Imported 2026-05-28
4 LangChain chat agent + GPT-4 on Mastermind 47% Imported 2026-05-28
5 LangChain chat agent + GPT-4 on Lateral Thinking Puzzles 20% Imported 2026-05-28
6 LangChain chat agent + GPT-4 on Sudoku 0% Imported 2026-05-28