TaxBench
TaxBench evaluates AI models on real-world tax tasks from Rivet's active tax workflows, spanning tax knowledge and judgment, tax calculations, and agentic data-retrieval question answering.
16rows
mean_pass5primary metric
2026-05-27sampled
Metadata
Metrics
Mean pass^5 (computed), Mean pass@1 (computed), Tax Knowledge pass@1, Tax Knowledge pass^5, Tax Calculations pass@1, Tax Calculations pass^5, Data Retrieval pass@1, Data Retrieval pass^5
| Rank | Subject | Mean pass^5 (computed) | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT 5.5 Pro | 29.27% mean pass^5 | GPT-5.5 Pro openai-gpt-5.5-pro | Imported | 2026-05-27 |
| 2 | GPT 5.4 Pro | 27.03% mean pass^5 | GPT-5.4 Pro openai-gpt-5.4-pro | Imported | 2026-05-27 |
| 3 | GPT 5.5 | 24.43% mean pass^5 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 4 | Claude Opus 4.6 | 21.37% mean pass^5 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-27 |
| 5 | Gemini 3.1 Pro | 20.10% mean pass^5 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-27 |
| 6 | Grok 4.1 Fast Reasoning | 19.63% mean pass^5 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-27 |
| 7 | GPT 5.2 Pro | 17.77% mean pass^5 | GPT-5.2 Pro openai-gpt-5.2-pro | Imported | 2026-05-27 |
| 8 | Gemini 3.1 Flash | 15.93% mean pass^5 | — | Imported | 2026-05-27 |
| 9 | Grok 4.2 Reasoning | 15.20% mean pass^5 | — | Imported | 2026-05-27 |
| 10 | Claude Opus 4.7 | 14.37% mean pass^5 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 11 | Grok 4.2 | 12.23% mean pass^5 | — | Imported | 2026-05-27 |
| 12 | Claude Sonnet 4.6 | 11.20% mean pass^5 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 13 | GPT 5.4 | 9.33% mean pass^5 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-27 |
| 14 | Gemini 2.5 Pro | 9.00% mean pass^5 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
| 15 | Claude Sonnet 4.5 | 8.03% mean pass^5 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-27 |
| 16 | GPT 5.2 | 4.60% mean pass^5 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-27 |
No matching rows.