ArtifactsBench
Benchmark for automated multimodal evaluation of visual and interactive artifact generation from code, using rendered artifacts and checklist-guided MLLM judging over 1,825 diverse tasks.
6rows
avg_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
AVG
| Rank | Subject | AVG | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5 | 72.55 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 2 | Claude Opus 4.1 | 59.76 | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-05-06 |
| 3 | Gemini-2.5-Pro | 57.74 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 4 | GPT-OSS-120B | 57.69 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 5 | Claude Sonnet 4 | 57.28 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 6 | Qwen3-235B-Thinking | 55.01 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Imported | 2026-05-06 |
No matching rows.