LLM Game Benchmark

Grid-based game competition benchmark evaluating LLM strategic play and invalid-move behavior in Tic-Tac-Toe, Connect Four, and Gomoku.

7rows
source_win_rateprimary metric
2026-05-06sampled

Metadata

Metrics

Source Win Rate, Wins, Losses (lower is better), Disqualifications (lower is better), Draws (lower is better), Invalid Move Rate (lower is better), Total Moves (lower is better), Game Records

Latest Results

Rows are aggregated from raw public leaderboard game records across first- and second-player appearances. Disqualifications are preserved separately and not converted into guessed wins.

Rank Subject Source Win Rate Model Match Provenance Sampled
1 anthropic.claude-3-5-sonnet-20240620-v1:0 0.42 Imported 2026-05-06
2 meta.llama3-70b-instruct-v1:0 0.42 Imported 2026-05-06
3 gpt-4o 0.40 GPT-4o
openai-gpt-4o
Imported 2026-05-06
4 gemini-1.5-pro 0.35 Imported 2026-05-06
5 gpt-4-turbo 0.35 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
6 gemini-1.5-flash 0.33 Imported 2026-05-06
7 anthropic.claude-3-sonnet-20240229-v1:0 0.32 Imported 2026-05-06