ArtifactsBench

Benchmark for automated multimodal evaluation of visual and interactive artifact generation from code, using rendered artifacts and checklist-guided MLLM judging over 1,825 diverse tasks.

6rows
avg_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

AVG

Latest Results

Rows are parsed from the ArtifactsBench README current-results table. The README states these Version 1.2 results are scored by Gemini-2.5-Pro.

Rank Subject AVG Model Match Provenance Sampled
1 GPT-5 72.55 GPT-5
openai-gpt-5
Imported 2026-05-06
2 Claude Opus 4.1 59.76 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
3 Gemini-2.5-Pro 57.74 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
4 GPT-OSS-120B 57.69 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
5 Claude Sonnet 4 57.28 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
6 Qwen3-235B-Thinking 55.01 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-06