Fiction.LiveBench

Fiction comprehension and reasoning benchmark for assessing model understanding over narrative text.

22rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 o3 100 o3
openai-o3
Imported 2026-05-06
2 Grok 4 96.90 GROK Grok 4
x-ai-grok-4
Imported 2026-05-06
3 GPT-5.2 96.90 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
4 Gemini 2.5 Pro (Jun 2025) 90.60 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
5 Qwen 3 235B 68.80 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
6 chatgpt-4o-01-29-2025 65.60 Imported 2026-05-06
7 Grok-3 mini 65.60 GROK Grok 3 Mini
x-ai-grok-3-mini
Imported 2026-05-06
8 GPT-4.1 63.90 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
9 o4-mini-2025-04-16 medium 62.50 o4 Mini
openai-o4-mini
Imported 2026-05-06
10 Minimax M2 59.40 MiniMax M2
minimax-minimax-m2
Imported 2026-05-06
11 o1 53.10 o1
openai-o1
Imported 2026-05-06
12 DeepSeek V3 53.10 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
13 Claude 3.7 Sonnet 53.10 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
14 GPT-4.1 mini 46.90 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-06
15 Qwen3-235B-A22B 44.40 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
16 Kimi K2 Instruct 40.60 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-06
17 Claude Opus 4.5 37.50 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
18 Gemini 2.0 Pro Exp (Feb 2025) 37.50 Imported 2026-05-06
19 Gemini 2.0 Flash Thinking Exp 37.50 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-06
20 Llama-4-Maverick-17B-128E-Instruct 36.40 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-06
21 DeepSeek R1 33.30 R1
deepseek-r1
Imported 2026-05-06
22 Llama 4 Scout 27.30 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-06