OpenHuEval
OpenHuEval evaluates large language models on Hungarian-specific tasks, including real user queries, self-awareness, proverb reasoning, generative evaluation, and fill-in-the-blank tasks.
10rows
macro_averageprimary metric
2026-05-06sampled
Metadata
Metrics
Macro Average (computed), HuWildBench WBScore, HuSimpleQA Accuracy, HuProverbRea Open-ended Accuracy, HuProverbRea 2CQ Accuracy, HuMatchingFIB Blank Accuracy, HuMatchingFIB Question Accuracy, HuStandardFIB Blank Accuracy, HuStandardFIB Question Accuracy
| Rank | Subject | Macro Average (computed) | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4o | 63.77 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 2 | Deepseek-R1 | 62.31 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 3 | Deepseek-V3 | 57.10 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
| 4 | Llama-3.1-Instruct-70B | 50.41 | — | Imported | 2026-05-06 |
| 5 | o1-mini | 49.38 | — | Imported | 2026-05-06 |
| 6 | GPT-4o-mini | 49.33 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-06 |
| 7 | Qwen2.5-Instruct-72B | 48.22 | — | Imported | 2026-05-06 |
| 8 | QwQ | 34.47 | — | Imported | 2026-05-06 |
| 9 | Llama-3.1-Instruct-8B | 28.73 | — | Imported | 2026-05-06 |
| 10 | Qwen2.5-Instruct-7B | 25.64 | — | Imported | 2026-05-06 |
No matching rows.