CharXiv-R

CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.

39rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Normalized Score

Showing 3 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.7 90.1% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
2 Claude Opus 4.8 89.9% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
1 Claude Mythos Preview 0.93 Claude Mythos Preview
anthropic-claude-mythos-preview
Self-reported 2026-05-06
2 Claude Opus 4.7 0.91 Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-06
3 Kimi K2.6 0.87 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-06
4 Muse Spark 0.86 Self-reported 2026-05-06
5 GPT-5.2 0.82 GPT-5.2
openai-gpt-5.2
Self-reported 2026-05-06
6 Qwen3.6 Plus 0.81 Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-06
7 Gemini 3 Pro 0.81 Gemini 3
google-gemini-3
Self-reported 2026-05-06
8 GPT-5 0.81 GPT-5
openai-gpt-5
Self-reported 2026-05-06
9 Gemini 3 Flash 0.80 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
10 Qwen3.5-27B 0.80 Qwen3.5-27B
qwen-qwen3.5-27b
Self-reported 2026-05-06
11 o3 0.79 o3
openai-o3
Self-reported 2026-05-06
12 Qwen3.6-27B 0.78 Qwen3.6 27B
qwen-qwen3.6-27b
Self-reported 2026-05-06
13 Qwen3.6-35B-A3B 0.78 Qwen3.6 35B A3B
qwen-qwen3.6-35b-a3b
Self-reported 2026-05-06
14 Qwen3.5-35B-A3B 0.78 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Self-reported 2026-05-06
14 Kimi K2.5 0.78 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Self-reported 2026-05-06
16 Claude Opus 4.6 0.77 Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-06
17 Qwen3.5-122B-A10B 0.77 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Self-reported 2026-05-06
18 Gemini 3.1 Flash-Lite 0.73 Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Self-reported 2026-05-06
19 o4-mini 0.72 o4 Mini
openai-o4-mini
Self-reported 2026-05-06
20 Qwen3 VL 235B A22B Thinking 0.66 Qwen3 VL 235B A22B Thinking
qwen-qwen3-vl-235b-a22b-thinking
Self-reported 2026-05-06
21 Qwen3 VL 32B Thinking 0.65 Self-reported 2026-05-06
22 Qwen3 VL 32B Instruct 0.63 Qwen3 VL 32B Instruct
qwen-qwen3-vl-32b-instruct
Self-reported 2026-05-06
23 Qwen3 VL 235B A22B Instruct 0.62 Qwen3 VL 235B A22B Instruct
qwen-qwen3-vl-235b-a22b-instruct
Self-reported 2026-05-06
24 GPT-4o 0.59 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
25 GPT-4.1 mini 0.57 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
26 GPT-4.1 0.57 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
27 Qwen3 VL 30B A3B Thinking 0.57 Qwen3 VL 30B A3B Thinking
qwen-qwen3-vl-30b-a3b-thinking
Self-reported 2026-05-06
28 GPT-4.5 0.55 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
29 Qwen3 VL 8B Thinking 0.53 Qwen3 VL 8B Thinking
qwen-qwen3-vl-8b-thinking
Self-reported 2026-05-06
30 Qwen3 VL 4B Thinking 0.50 Self-reported 2026-05-06
31 Qwen3 VL 30B A3B Instruct 0.49 Qwen3 VL 30B A3B Instruct
qwen-qwen3-vl-30b-a3b-instruct
Self-reported 2026-05-06
32 Qwen3 VL 8B Instruct 0.46 Qwen3 VL 8B Instruct
qwen-qwen3-vl-8b-instruct
Self-reported 2026-05-06
33 GPT-4.1 nano 0.41 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06
34 Qwen3 VL 4B Instruct 0.40 Self-reported 2026-05-06
1 Claude Mythos Preview 93.2% Claude Mythos Preview
anthropic-claude-mythos-preview
Launch post 2026-04-16
2 Claude Opus 4.7 91% Claude Opus 4.7
anthropic-claude-opus-4.7
Launch post 2026-04-16
3 Claude Opus 4.6 84.7% Claude Opus 4.6
anthropic-claude-opus-4.6
Launch post 2026-04-16