CharXiv-D
CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.
13rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Qwen3 VL 32B Instruct | 0.91 | Qwen3 VL 32B Instruct qwen-qwen3-vl-32b-instruct | Self-reported | 2026-05-06 |
| 2 | Qwen3 VL 32B Thinking | 0.90 | — | Self-reported | 2026-05-06 |
| 3 | GPT-4.5 | 0.90 | GPT-4.5 openai-gpt-4.5-preview | Self-reported | 2026-05-06 |
| 4 | GPT-4.1 mini | 0.88 | GPT-4.1 Mini openai-gpt-4.1-mini | Self-reported | 2026-05-06 |
| 5 | GPT-4.1 | 0.88 | GPT-4.1 openai-gpt-4.1 | Self-reported | 2026-05-06 |
| 6 | Qwen3 VL 30B A3B Thinking | 0.87 | Qwen3 VL 30B A3B Thinking qwen-qwen3-vl-30b-a3b-thinking | Self-reported | 2026-05-06 |
| 7 | Qwen3 VL 8B Thinking | 0.86 | Qwen3 VL 8B Thinking qwen-qwen3-vl-8b-thinking | Self-reported | 2026-05-06 |
| 8 | Qwen3 VL 30B A3B Instruct | 0.85 | Qwen3 VL 30B A3B Instruct qwen-qwen3-vl-30b-a3b-instruct | Self-reported | 2026-05-06 |
| 9 | GPT-4o | 0.85 | GPT-4o (2024-08-06) openai-gpt-4o-2024-08-06 | Self-reported | 2026-05-06 |
| 10 | Qwen3 VL 4B Thinking | 0.84 | — | Self-reported | 2026-05-06 |
| 11 | Qwen3 VL 8B Instruct | 0.83 | Qwen3 VL 8B Instruct qwen-qwen3-vl-8b-instruct | Self-reported | 2026-05-06 |
| 12 | Qwen3 VL 4B Instruct | 0.76 | — | Self-reported | 2026-05-06 |
| 13 | GPT-4.1 nano | 0.74 | GPT-4.1 Nano openai-gpt-4.1-nano | Self-reported | 2026-05-06 |
No matching rows.