CharXiv-D

CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.

13rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Qwen3 VL 32B Instruct 0.91 Qwen3 VL 32B Instruct
qwen-qwen3-vl-32b-instruct
Self-reported 2026-05-06
2 Qwen3 VL 32B Thinking 0.90 Self-reported 2026-05-06
3 GPT-4.5 0.90 GPT-4.5
openai-gpt-4.5-preview
Self-reported 2026-05-06
4 GPT-4.1 mini 0.88 GPT-4.1 Mini
openai-gpt-4.1-mini
Self-reported 2026-05-06
5 GPT-4.1 0.88 GPT-4.1
openai-gpt-4.1
Self-reported 2026-05-06
6 Qwen3 VL 30B A3B Thinking 0.87 Qwen3 VL 30B A3B Thinking
qwen-qwen3-vl-30b-a3b-thinking
Self-reported 2026-05-06
7 Qwen3 VL 8B Thinking 0.86 Qwen3 VL 8B Thinking
qwen-qwen3-vl-8b-thinking
Self-reported 2026-05-06
8 Qwen3 VL 30B A3B Instruct 0.85 Qwen3 VL 30B A3B Instruct
qwen-qwen3-vl-30b-a3b-instruct
Self-reported 2026-05-06
9 GPT-4o 0.85 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Self-reported 2026-05-06
10 Qwen3 VL 4B Thinking 0.84 Self-reported 2026-05-06
11 Qwen3 VL 8B Instruct 0.83 Qwen3 VL 8B Instruct
qwen-qwen3-vl-8b-instruct
Self-reported 2026-05-06
12 Qwen3 VL 4B Instruct 0.76 Self-reported 2026-05-06
13 GPT-4.1 nano 0.74 GPT-4.1 Nano
openai-gpt-4.1-nano
Self-reported 2026-05-06