ToolQA

ToolQA: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.

6rows
toolqa_mean_averageprimary metric
2026-05-27sampled

Metadata

Metrics

Mean of Easy/Hard averages, Easy average, Hard average, Easy Flight, Easy Coffee, Easy Agenda, Easy Yelp, Easy DBLP, Easy GSM8K, Easy SciREX, Easy Airbnb, Hard Flight, Hard Coffee, Hard Agenda, Hard Yelp, Hard DBLP, Hard SciREX, Hard Airbnb

Latest Results

Rows are parsed from the public ToolQA benchmark README markdown performance tables. The primary score is the mean of the Easy and Hard average columns; split averages and topic scores are preserved as metrics.

Rank Subject Mean of Easy/Hard averages Model Match Provenance Sampled
1 ReAct (GPT-3) easy=43.1, hard=5.1, mean=24.1 Imported 2026-05-27
2 ReAct (GPT-3.5) easy=36.8, hard=8.2, mean=22.5 Imported 2026-05-27
3 Chameleon easy=10.6, hard=1.9, mean=6.25 Imported 2026-05-27
4 ChatGPT easy=5.6, hard=2.0, mean=3.8 Imported 2026-05-27
5 CoT easy=5.1, hard=1.4, mean=3.25 Imported 2026-05-27
6 LLaMA-2 (13B) easy=2.3, hard=1.7, mean=2.0 Imported 2026-05-27