AutoLogi
AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.
2rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Kimi K2 Instruct | 0.90 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Self-reported | 2026-05-06 |
| 1 | Kimi K2-Instruct-0905 | 0.90 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Self-reported | 2026-05-06 |
No matching rows.