ConStory-Bench

Long story generation benchmark measuring cross-scene consistency bugs using Consistency Error Density over 2,000 generated stories.

33rows
cedprimary metric
2026-05-28sampled

Metadata

Metrics

Consistency Error Density (lower is better), Average Words (lower is better), Total Stories

Latest Results

Rows are imported from the official ConStory-Bench GitHub README leaderboard. CED is lower-is-better.

Rank Subject Consistency Error Density Model Match Provenance Sampled
1 GPT-5-Reasoning CED 0.113 GPT-5
openai-gpt-5
Imported 2026-05-28
2 Gemini-2.5-Pro CED 0.302 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
3 Gemini-2.5-Flash CED 0.305 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-28
4 Claude-Sonnet-4.5 CED 0.52 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
5 GLM-4.6 CED 0.528 GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-05-28
6 Qwen3-32B CED 0.537 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-28
7 Ring-1T CED 0.539 Imported 2026-05-28
8 DeepSeek-V3.2-Exp CED 0.541 DeepSeek V3.2 Exp
deepseek-deepseek-v3.2-exp
Imported 2026-05-28
9 Qwen3-235B-A22B-Thinking CED 0.559 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-28
10 GLM-4.5 CED 0.595 GLM GLM 4.5
z-ai-glm-4.5
Imported 2026-05-28
11 LongWriter-Zero-32B CED 0.669 Imported 2026-05-28
12 Grok-4 CED 0.67 GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
13 SuperWriter CED 0.674 Imported 2026-05-28
14 Ling-1T CED 0.699 Imported 2026-05-28
15 GPT-4o-1120 CED 0.711 GPT-4o
openai-gpt-4o
Imported 2026-05-28
16 Step3 CED 0.845 Imported 2026-05-28
17 Qwen3-Next-80B-Thinking CED 0.959 Imported 2026-05-28
18 DOME CED 1.033 Imported 2026-05-28
19 Doubao-1.6-Thinking-2507 CED 1.217 Imported 2026-05-28
20 Kimi-K2-2509 CED 1.3 Imported 2026-05-28
21 Kimi-K2-2507 CED 1.33 Imported 2026-05-28
22 Mistral-Medium-3.1 CED 1.355 Mistral: Mistral Medium 3.1
mistralai-mistral-medium-3.1
Imported 2026-05-28
23 Qwen3-235B-A22B CED 1.447 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-28
24 Qwen3-Next-80B CED 1.603 Imported 2026-05-28
25 Qwen3-4B-Instruct-2507 CED 1.685 Imported 2026-05-28
26 Nvidia-Llama-3.1-Ultra CED 1.833 Imported 2026-05-28
27 Qwen3-30B-A3B-Instruct-2507 CED 2.13 Qwen3 30B A3B Instruct 2507
qwen-qwen3-30b-a3b-instruct-2507
Imported 2026-05-28
28 DeepSeek-V3 CED 2.422 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-28
29 Suri-ORPO CED 2.445 Imported 2026-05-28
30 QwenLong-L1-32B CED 3.413 Imported 2026-05-28
31 DeepSeek-R1 CED 3.419 R1
deepseek-r1
Imported 2026-05-28
32 MiniMax-M1-80k CED 3.447 Imported 2026-05-28
33 LongAlign-13B CED 3.664 Imported 2026-05-28