ToolQA
ToolQA: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
6rows
toolqa_mean_averageprimary metric
2026-05-27sampled
Metadata
Metrics
Mean of Easy/Hard averages, Easy average, Hard average, Easy Flight, Easy Coffee, Easy Agenda, Easy Yelp, Easy DBLP, Easy GSM8K, Easy SciREX, Easy Airbnb, Hard Flight, Hard Coffee, Hard Agenda, Hard Yelp, Hard DBLP, Hard SciREX, Hard Airbnb
| Rank | Subject | Mean of Easy/Hard averages | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | ReAct (GPT-3) | easy=43.1, hard=5.1, mean=24.1 | — | Imported | 2026-05-27 |
| 2 | ReAct (GPT-3.5) | easy=36.8, hard=8.2, mean=22.5 | — | Imported | 2026-05-27 |
| 3 | Chameleon | easy=10.6, hard=1.9, mean=6.25 | — | Imported | 2026-05-27 |
| 4 | ChatGPT | easy=5.6, hard=2.0, mean=3.8 | — | Imported | 2026-05-27 |
| 5 | CoT | easy=5.1, hard=1.4, mean=3.25 | — | Imported | 2026-05-27 |
| 6 | LLaMA-2 (13B) | easy=2.3, hard=1.7, mean=2.0 | — | Imported | 2026-05-27 |
No matching rows.