Poker Agent
Which model can make the most money playing poker?
17rows
scoreprimary metric
2025-12-23sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT 5.2 2025-12-11 | 1131.833% | GPT-5.2 openai-gpt-5.2 | Imported | 2025-12-23 |
| 2 | GPT 5.2025-08-07 | 1103.175% | GPT-5 openai-gpt-5 | Imported | 2025-12-23 |
| 3 | Gemini 3 Flash Preview | 1100.213% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2025-12-23 |
| 4 | DeepSeek V3P2 Thinking | 1090.304% | — | Imported | 2025-12-23 |
| 5 | Grok 4.1 Fast Reasoning | 1079.215% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2025-12-23 |
| 6 | Gemini 3 Pro Preview | 1078.905% | Gemini 3 google-gemini-3 | Imported | 2025-12-23 |
| 7 | DeepSeek V3P1 | 1058.233% | — | Imported | 2025-12-23 |
| 8 | Claude Sonnet 4.5 20250929 | 1055.504% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2025-12-23 |
| 9 | GPT 5.1 2025-11-13 | 1038.593% | GPT-5.1 openai-gpt-5.1 | Imported | 2025-12-23 |
| 10 | Grok 4 Fast Reasoning | 1034.3% | Grok 4 Fast x-ai-grok-4-fast | Imported | 2025-12-23 |
| 11 | Claude Opus 4.5 20251101 | 1033.379% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2025-12-23 |
| 12 | Gemini 2.5 Pro | 1032.596% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2025-12-23 |
| 13 | GPT Oss 120B | 1015.331% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2025-12-23 |
| 14 | Kimi K2 Thinking | 1011.634% | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2025-12-23 |
| 15 | Qwen 3 Max Preview | 994.512% | — | Imported | 2025-12-23 |
| 16 | GLM 4.6 | 945.756% | GLM 4.6 z-ai-glm-4.6 | Imported | 2025-12-23 |
| 17 | Llama4 Maverick Instruct Basic | 890.504% | — | Imported | 2025-12-23 |
No matching rows.