FinanceBench
FinanceBench evaluates language models on financial analysis questions with source documents, gold answers, and human-annotated model completions.
16rows
accuracyprimary metric
2026-05-06sampled
Metadata
Metrics
Accuracy, Answered Accuracy, Total Count, Correct Count, Incorrect Count (lower is better), Refusal Count (lower is better)
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-4-1106-preview (oracle_reverse) | 89.33 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 2 | gpt-4-1106-preview (oracle) | 85.33 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 3 | gpt-4 (oracle) | 84 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 4 | gpt-4 (oracle_reverse) | 78.67 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 5 | gpt-4-1106-preview (inContext_reverse) | 78.67 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 6 | claude-2 (inContext_reverse) | 76 | — | Imported | 2026-05-06 |
| 7 | gpt-4-1106-preview (singleStore) | 50 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 8 | gpt-4 (singleStore) | 42 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 9 | llama2 (singleStore) | 41.33 | — | Imported | 2026-05-06 |
| 10 | claude-2 (inContext) | 37.33 | — | Imported | 2026-05-06 |
| 11 | gpt-4-1106-preview (inContext) | 24.67 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 12 | gpt-4-1106-preview (sharedStore) | 19.33 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 13 | llama2 (sharedStore) | 19.33 | — | Imported | 2026-05-06 |
| 14 | gpt-4 (sharedStore) | 16.67 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 15 | gpt-4-1106-preview (closedBook) | 9.33 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 16 | gpt-4 (closedBook) | 4.67 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
No matching rows.