HumanEval-Mul
A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
2rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | DeepSeek-V3 | 0.83 | DeepSeek V3 deepseek-deepseek-chat | Self-reported | 2026-05-06 |
| 2 | DeepSeek-V2.5 | 0.74 | — | Self-reported | 2026-05-06 |
No matching rows.