StructEval
Structured-output benchmark evaluating text and visual structured generation and conversion across 18 formats and 2,035 examples.
12rows
averageprimary metric
2026-05-28sampled
Metadata
Metrics
Average, StructEval-T Generation, StructEval-T Conversion, StructEval-V Generation, StructEval-V Conversion
| Rank | Subject | Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4o | 76.02% | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 2 | GPT-4.1-mini | 75.64% | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-28 |
| 3 | o1-mini | 75.58% | — | Imported | 2026-05-28 |
| 4 | GPT-4o-mini | 73.19% | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
| 5 | Gemini-1.5-pro | 71.75% | — | Imported | 2026-05-28 |
| 6 | Qwen3-4B | 67.04% | — | Imported | 2026-05-28 |
| 7 | Gemini-2.0-flash | 62.55% | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-28 |
| 8 | Llama-3.1-8B-Instruct | 61.77% | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-28 |
| 9 | Qwen2.5-7B-Instruct | 59.03% | Qwen2.5 7B Instruct qwen-qwen-2.5-7b-instruct | Imported | 2026-05-28 |
| 10 | Phi-4-mini-instruct | 56.97% | — | Imported | 2026-05-28 |
| 11 | Meta-Llama-3-8B-Instruct | 51.59% | Llama 3 8B Instruct meta-llama-llama-3-8b-instruct | Imported | 2026-05-28 |
| 12 | Phi-3-mini-128k-instruct | 40.79% | — | Imported | 2026-05-28 |
No matching rows.