AppWorld
Benchmark for interactive app-based task completion across simulated digital services, evaluating agents on tool use and stateful workflows.
15rows
percent_successfulprimary metric
2026-05-06sampled
Metadata
Metrics
Successful Sessions, Benchmark Score, Finished Successful, Avg. Agent Cost (lower is better), Avg. Steps (lower is better)
| Rank | Subject | Successful Sessions | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | SmolAgents Code / openai/aws/claude-opus-4-5 | 0.70 | — | Imported | 2026-05-06 |
| 2 | OpenAI Solo / openai/aws/claude-opus-4-5 | 0.68 | — | Imported | 2026-05-06 |
| 3 | Claude Code CLI / openai/aws/claude-opus-4-5 | 0.66 | — | Imported | 2026-05-06 |
| 4 | LiteLLM Tool Calling with Shortlisting / openai/aws/claude-opus-4-5 | 0.64 | — | Imported | 2026-05-06 |
| 5 | LiteLLM Tool Calling / openai/aws/claude-opus-4-5 | 0.61 | — | Imported | 2026-05-06 |
| 6 | OpenAI Solo / openai/gcp/gemini-3-pro-preview | 0.57 | — | Imported | 2026-05-06 |
| 7 | LiteLLM Tool Calling with Shortlisting / openai/gcp/gemini-3-pro-preview | 0.55 | — | Imported | 2026-05-06 |
| 8 | LiteLLM Tool Calling / openai/gcp/gemini-3-pro-preview | 0.50 | — | Imported | 2026-05-06 |
| 9 | Claude Code CLI / openai/gcp/gemini-3-pro-preview | 0.36 | — | Imported | 2026-05-06 |
| 10 | LiteLLM Tool Calling with Shortlisting / openai/Azure/gpt-5.2-2025-12-11 | 0.22 | — | Imported | 2026-05-06 |
| 11 | SmolAgents Code / openai/gcp/gemini-3-pro-preview | 0.13 | — | Imported | 2026-05-06 |
| 12 | SmolAgents Code / openai/Azure/gpt-5.2-2025-12-11 | 0.07 | — | Imported | 2026-05-06 |
| 13 | Claude Code CLI / openai/Azure/gpt-5.2-2025-12-11 | 0 | — | Imported | 2026-05-06 |
| 14 | OpenAI Solo / openai/Azure/gpt-5.2-2025-12-11 | 0 | — | Imported | 2026-05-06 |
| 15 | LiteLLM Tool Calling / openai/Azure/gpt-5.2-2025-12-11 | 0 | — | Imported | 2026-05-06 |
No matching rows.