TableBench
TableBench: Measures structured-data reasoning over tables, spreadsheets, charts, databases, or data analysis tasks.
36rows
overallprimary metric
2026-05-27sampled
Metadata
Metrics
Overall, Fact checking, Numerical reasoning, Data analysis, Visualization
| Rank | Subject | Overall | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Human Performance | 85.91% | — | Imported | 2026-05-27 |
| 2 | ButtonAgent | 64.14% | — | Imported | 2026-05-27 |
| 3 | RankAgent | 62.14% | — | Imported | 2026-05-27 |
| 4 | o4-mini-high + DP | 61.69% | — | Imported | 2026-05-27 |
| 5 | o4-mini + DP | 60.75% | — | Imported | 2026-05-27 |
| 6 | GPT-5 + DP | 59.94% | GPT-5 openai-gpt-5 | Imported | 2026-05-27 |
| 7 | o3-mini + DP | 59.9% | — | Imported | 2026-05-27 |
| 8 | Grok4 + DP | 57.8% | — | Imported | 2026-05-27 |
| 9 | Gemini-2.5-Pro + DP | 57.18% | — | Imported | 2026-05-27 |
| 10 | Deepseek-R1 + DP | 56.31% | — | Imported | 2026-05-27 |
| 11 | Claude4-Sonnet + DP | 54.75% | — | Imported | 2026-05-27 |
| 12 | Llama-4-Maverick-17B-128E-Instruct + TCoT | 52.73% | — | Imported | 2026-05-27 |
| 13 | Qwen3-32B | 52.45% | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 14 | GPT-4o + TCoT | 51.96% | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 15 | GPT-4-Turbo + TCoT | 51.5% | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-27 |
| 16 | Deepseek-Chat-V3 + TCoT | 50.56% | — | Imported | 2026-05-27 |
| 17 | Llama-3.1-405B-Instruct + TCoT | 48.87% | — | Imported | 2026-05-27 |
| 18 | Qwen2.5-72B-Instruct + TCoT | 48.79% | — | Imported | 2026-05-27 |
| 19 | Llama-4-Scout-17B-16E-Instruct + TCoT | 46.53% | — | Imported | 2026-05-27 |
| 20 | Qwen2.5-Coder-32B-Instruct + TCoT | 45.51% | — | Imported | 2026-05-27 |
| 21 | QWQ-32B + DP | 43.87% | — | Imported | 2026-05-27 |
| 22 | Llama3.1-70B-Instruct + TCoT | 41.05% | — | Imported | 2026-05-27 |
| 23 | TableGPT2-7B + TCoT | 41.05% | — | Imported | 2026-05-27 |
| 24 | Llama3-70B-Chat + TCoT | 38.68% | — | Imported | 2026-05-27 |
| 25 | GPT-3.5-Turbo + PoT | 37.15% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 26 | Qwen2.5-Coder-7B-Instruct + TCoT | 35.12% | — | Imported | 2026-05-27 |
| 27 | TableLLM-Qwen2-7B + TCoT | 31.9% | — | Imported | 2026-05-27 |
| 28 | TableLLM-Llama3.1-8B + TCoT | 30.77% | — | Imported | 2026-05-27 |
| 29 | TableLLM-DeepseekCoder-7B + TCoT | 30.51% | — | Imported | 2026-05-27 |
| 30 | TableLLM-Llama3-8B + TCoT | 29.8% | — | Imported | 2026-05-27 |
| 31 | TableLLM-CodeQwen-7B + TCoT | 24.81% | — | Imported | 2026-05-27 |
| 32 | Llama3-8B-Chat + SCoT | 22.2% | — | Imported | 2026-05-27 |
| 33 | Qwen2.5-7B-Instruct + TCoT | 22.14% | — | Imported | 2026-05-27 |
| 34 | Mixtral-8x7B-Instruct + PoT | 21.7% | — | Imported | 2026-05-27 |
| 35 | Llama3.1-8B-Instruct + DP | 15.42% | — | Imported | 2026-05-27 |
| 36 | Mistral-7B-Instruct + SCoT | 10.97% | — | Imported | 2026-05-27 |
No matching rows.