AppWorld

Benchmark for interactive app-based task completion across simulated digital services, evaluating agents on tool use and stateful workflows.

15rows
percent_successfulprimary metric
2026-05-06sampled

Metadata

Metrics

Successful Sessions, Benchmark Score, Finished Successful, Avg. Agent Cost (lower is better), Avg. Steps (lower is better)

Latest Results

Rows are ranked by percent_successful. Agent and model display names are preserved from the source dataset.

Rank Subject Successful Sessions Model Match Provenance Sampled
1 SmolAgents Code / openai/aws/claude-opus-4-5 0.70 Imported 2026-05-06
2 OpenAI Solo / openai/aws/claude-opus-4-5 0.68 Imported 2026-05-06
3 Claude Code CLI / openai/aws/claude-opus-4-5 0.66 Imported 2026-05-06
4 LiteLLM Tool Calling with Shortlisting / openai/aws/claude-opus-4-5 0.64 Imported 2026-05-06
5 LiteLLM Tool Calling / openai/aws/claude-opus-4-5 0.61 Imported 2026-05-06
6 OpenAI Solo / openai/gcp/gemini-3-pro-preview 0.57 Imported 2026-05-06
7 LiteLLM Tool Calling with Shortlisting / openai/gcp/gemini-3-pro-preview 0.55 Imported 2026-05-06
8 LiteLLM Tool Calling / openai/gcp/gemini-3-pro-preview 0.50 Imported 2026-05-06
9 Claude Code CLI / openai/gcp/gemini-3-pro-preview 0.36 Imported 2026-05-06
10 LiteLLM Tool Calling with Shortlisting / openai/Azure/gpt-5.2-2025-12-11 0.22 Imported 2026-05-06
11 SmolAgents Code / openai/gcp/gemini-3-pro-preview 0.13 Imported 2026-05-06
12 SmolAgents Code / openai/Azure/gpt-5.2-2025-12-11 0.07 Imported 2026-05-06
13 Claude Code CLI / openai/Azure/gpt-5.2-2025-12-11 0 Imported 2026-05-06
14 OpenAI Solo / openai/Azure/gpt-5.2-2025-12-11 0 Imported 2026-05-06
15 LiteLLM Tool Calling / openai/Azure/gpt-5.2-2025-12-11 0 Imported 2026-05-06