FinToolBench

Financial tool-use benchmark with real tools and APIs, measuring tool invocation, execution success, compliance, and soft-scored task quality.

4rows
soft_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Soft Score, Tool Invocation Rate, Tool Execution Success Rate, Conditional Execution Rate, Conditional Soft Score, Timeliness Mismatch Rate (lower is better), Intent Mismatch Rate (lower is better), Domain Mismatch Rate (lower is better)

Latest Results

Rows are transcribed from public FinToolBench arXiv Table 3. Primary score is Soft Score; tool invocation, execution, conditional execution, conditional soft score, and compliance mismatch rates are preserved.

Rank Subject Soft Score Model Match Provenance Sampled
1 Qwen3-8B 0.4234 Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27
2 Doubao-Seed-1.6 0.3958 Imported 2026-05-27
3 GLM-4.7-Flash 0.2769 GLM GLM 4.7 Flash
z-ai-glm-4.7-flash
Imported 2026-05-27
4 GPT-4o 0.2302 GPT-4o
openai-gpt-4o
Imported 2026-05-27