OpenHuEval

OpenHuEval evaluates large language models on Hungarian-specific tasks, including real user queries, self-awareness, proverb reasoning, generative evaluation, and fill-in-the-blank tasks.

10rows
macro_averageprimary metric
2026-05-06sampled

Metadata

Metrics

Macro Average (computed), HuWildBench WBScore, HuSimpleQA Accuracy, HuProverbRea Open-ended Accuracy, HuProverbRea 2CQ Accuracy, HuMatchingFIB Blank Accuracy, HuMatchingFIB Question Accuracy, HuStandardFIB Blank Accuracy, HuStandardFIB Question Accuracy

Latest Results

Rows are parsed from the public static leaderboard table. The source does not publish a single overall score, so macro_average is computed as the unweighted mean of the eight displayed metrics and used only for ordering in this snapshot.

Rank Subject Macro Average (computed) Model Match Provenance Sampled
1 GPT-4o 63.77 GPT-4o
openai-gpt-4o
Imported 2026-05-06
2 Deepseek-R1 62.31 R1
deepseek-r1
Imported 2026-05-06
3 Deepseek-V3 57.10 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
4 Llama-3.1-Instruct-70B 50.41 Imported 2026-05-06
5 o1-mini 49.38 Imported 2026-05-06
6 GPT-4o-mini 49.33 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-06
7 Qwen2.5-Instruct-72B 48.22 Imported 2026-05-06
8 QwQ 34.47 Imported 2026-05-06
9 Llama-3.1-Instruct-8B 28.73 Imported 2026-05-06
10 Qwen2.5-Instruct-7B 25.64 Imported 2026-05-06