LLM Game Benchmark
Grid-based game competition benchmark evaluating LLM strategic play and invalid-move behavior in Tic-Tac-Toe, Connect Four, and Gomoku.
7rows
source_win_rateprimary metric
2026-05-06sampled
Metadata
Metrics
Source Win Rate, Wins, Losses (lower is better), Disqualifications (lower is better), Draws (lower is better), Invalid Move Rate (lower is better), Total Moves (lower is better), Game Records
| Rank | Subject | Source Win Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | anthropic.claude-3-5-sonnet-20240620-v1:0 | 0.42 | — | Imported | 2026-05-06 |
| 2 | meta.llama3-70b-instruct-v1:0 | 0.42 | — | Imported | 2026-05-06 |
| 3 | gpt-4o | 0.40 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 4 | gemini-1.5-pro | 0.35 | — | Imported | 2026-05-06 |
| 5 | gpt-4-turbo | 0.35 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-06 |
| 6 | gemini-1.5-flash | 0.33 | — | Imported | 2026-05-06 |
| 7 | anthropic.claude-3-sonnet-20240229-v1:0 | 0.32 | — | Imported | 2026-05-06 |
No matching rows.