FinToolBench
Financial tool-use benchmark with real tools and APIs, measuring tool invocation, execution success, compliance, and soft-scored task quality.
4rows
soft_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Soft Score, Tool Invocation Rate, Tool Execution Success Rate, Conditional Execution Rate, Conditional Soft Score, Timeliness Mismatch Rate (lower is better), Intent Mismatch Rate (lower is better), Domain Mismatch Rate (lower is better)
| Rank | Subject | Soft Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Qwen3-8B | 0.4234 | Qwen3 8B qwen-qwen3-8b | Imported | 2026-05-27 |
| 2 | Doubao-Seed-1.6 | 0.3958 | — | Imported | 2026-05-27 |
| 3 | GLM-4.7-Flash | 0.2769 | GLM 4.7 Flash z-ai-glm-4.7-flash | Imported | 2026-05-27 |
| 4 | GPT-4o | 0.2302 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
No matching rows.