AutoLogi

AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.

2rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: autologi
Category: Reasoning
Release: 2025-02-24
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Kimi K2 Instruct	0.90	KIMI MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2	Self-reported	2026-05-06
1	Kimi K2-Instruct-0905	0.90	KIMI MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905	Self-reported	2026-05-06

Metadata

Metrics

Latest Results