HumanEval-Mul

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

2rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 DeepSeek-V3 0.83 DeepSeek V3
deepseek-deepseek-chat
Self-reported 2026-05-06
2 DeepSeek-V2.5 0.74 Self-reported 2026-05-06