FinanceBench

FinanceBench evaluates language models on financial analysis questions with source documents, gold answers, and human-annotated model completions.

16rows
accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Accuracy, Answered Accuracy, Total Count, Correct Count, Incorrect Count (lower is better), Refusal Count (lower is better)

Latest Results

Each row aggregates one public FinanceBench result JSONL file by source model_name and eval_mode. Accuracy counts Correct Answer labels over all 150 questions; answered_accuracy excludes Refusal labels.

Rank Subject Accuracy Model Match Provenance Sampled
1 gpt-4-1106-preview (oracle_reverse) 89.33 GPT-4
openai-gpt-4
Imported 2026-05-06
2 gpt-4-1106-preview (oracle) 85.33 GPT-4
openai-gpt-4
Imported 2026-05-06
3 gpt-4 (oracle) 84 GPT-4
openai-gpt-4
Imported 2026-05-06
4 gpt-4 (oracle_reverse) 78.67 GPT-4
openai-gpt-4
Imported 2026-05-06
5 gpt-4-1106-preview (inContext_reverse) 78.67 GPT-4
openai-gpt-4
Imported 2026-05-06
6 claude-2 (inContext_reverse) 76 Imported 2026-05-06
7 gpt-4-1106-preview (singleStore) 50 GPT-4
openai-gpt-4
Imported 2026-05-06
8 gpt-4 (singleStore) 42 GPT-4
openai-gpt-4
Imported 2026-05-06
9 llama2 (singleStore) 41.33 Imported 2026-05-06
10 claude-2 (inContext) 37.33 Imported 2026-05-06
11 gpt-4-1106-preview (inContext) 24.67 GPT-4
openai-gpt-4
Imported 2026-05-06
12 gpt-4-1106-preview (sharedStore) 19.33 GPT-4
openai-gpt-4
Imported 2026-05-06
13 llama2 (sharedStore) 19.33 Imported 2026-05-06
14 gpt-4 (sharedStore) 16.67 GPT-4
openai-gpt-4
Imported 2026-05-06
15 gpt-4-1106-preview (closedBook) 9.33 GPT-4
openai-gpt-4
Imported 2026-05-06
16 gpt-4 (closedBook) 4.67 GPT-4
openai-gpt-4
Imported 2026-05-06