StructEval

Structured-output benchmark evaluating text and visual structured generation and conversion across 18 formats and 2,035 examples.

12rows
averageprimary metric
2026-05-28sampled

Metadata

Metrics

Average, StructEval-T Generation, StructEval-T Conversion, StructEval-V Generation, StructEval-V Conversion

Latest Results

Rows are imported from the official StructEval static site JavaScript leaderboardData array and sorted by average score.

Rank Subject Average Model Match Provenance Sampled
1 GPT-4o 76.02% GPT-4o
openai-gpt-4o
Imported 2026-05-28
2 GPT-4.1-mini 75.64% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-28
3 o1-mini 75.58% Imported 2026-05-28
4 GPT-4o-mini 73.19% GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-28
5 Gemini-1.5-pro 71.75% Imported 2026-05-28
6 Qwen3-4B 67.04% Imported 2026-05-28
7 Gemini-2.0-flash 62.55% Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-28
8 Llama-3.1-8B-Instruct 61.77% Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-28
9 Qwen2.5-7B-Instruct 59.03% Qwen2.5 7B Instruct
qwen-qwen-2.5-7b-instruct
Imported 2026-05-28
10 Phi-4-mini-instruct 56.97% Imported 2026-05-28
11 Meta-Llama-3-8B-Instruct 51.59% Llama 3 8B Instruct
meta-llama-llama-3-8b-instruct
Imported 2026-05-28
12 Phi-3-mini-128k-instruct 40.79% Imported 2026-05-28