IneqMath
Olympiad-level inequality proof benchmark evaluating both final-answer correctness and step-wise reasoning soundness for chat and reasoning LLMs.
55rows
overall_accuracyprimary metric
2026-05-06sampled
Metadata
Metrics
Overall Accuracy, Answer Accuracy, Step Accuracy (NTC), Step Accuracy (NLG), Step Accuracy (NAE), Step Accuracy (NCE)
| Rank | Subject | Overall Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5 (medium, 30K) | 47 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 2 | o3-pro (medium, 40K) | 46 | o3 Pro openai-o3-pro | Imported | 2026-05-06 |
| 3 | Gemini 2.5 Pro Preview (40K) | 46 | Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview | Imported | 2026-05-06 |
| 4 | o3-pro (medium, 10K) | 45.50 | o3 Pro openai-o3-pro | Imported | 2026-05-06 |
| 5 | Gemini 2.5 Pro (30K) | 43.50 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 6 | o3 (medium, 40K) | 37 | o3 openai-o3 | Imported | 2026-05-06 |
| 7 | GPT-5 mini (medium, 10K) | 30.50 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 8 | GPT-5 (medium, 10K) | 28 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 9 | Gemini 2.5 Flash Preview 05-20 (40K) | 27.50 | — | Imported | 2026-05-06 |
| 10 | gpt-oss-120b (10K) | 23.50 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 11 | Gemini 2.5 Flash (40K) | 23.50 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 12 | o3 (medium, 10K) | 21 | o3 openai-o3 | Imported | 2026-05-06 |
| 13 | DeepSeek-V3.1 (Thinking Mode) (30K) | 15.50 | DeepSeek V3.1 deepseek-deepseek-chat-v3.1 | Imported | 2026-05-06 |
| 14 | o4-mini (medium, 10K) | 15.50 | o4 Mini openai-o4-mini | Imported | 2026-05-06 |
| 15 | Gemini 2.5 Flash Preview 05-20 (10K) | 14.50 | — | Imported | 2026-05-06 |
| 16 | DeepSeek-V3.1 (Thinking Mode) (10K) | 12 | DeepSeek V3.1 deepseek-deepseek-chat-v3.1 | Imported | 2026-05-06 |
| 17 | Gemini 2.5 Pro Preview (10K) | 10 | Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview | Imported | 2026-05-06 |
| 18 | DeepSeek-R1-0528 (40K) | 9.50 | R1 0528 deepseek-deepseek-r1-0528 | Imported | 2026-05-06 |
| 19 | o3-mini (medium, 10K) | 9.50 | o3-mini openai-o3-mini | Imported | 2026-05-06 |
| 20 | Kimi K2 Instruct | 9 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-05-06 |
| 21 | Grok 4 (40K) | 8 | Grok 4 x-ai-grok-4 | Imported | 2026-05-06 |
| 22 | o1 (medium, 10K) | 8 | o1 openai-o1 | Imported | 2026-05-06 |
| 23 | o1 (medium, 40K) | 7.50 | o1 openai-o1 | Imported | 2026-05-06 |
| 24 | DeepSeek-V3-0324 | 7 | DeepSeek V3 0324 deepseek-deepseek-chat-v3-0324 | Imported | 2026-05-06 |
| 25 | Grok 3 mini (medium, 10K) | 6 | Grok 3 Mini x-ai-grok-3-mini | Imported | 2026-05-06 |
| 26 | Qwen3-235B-A22B (10K) | 6 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-06 |
| 27 | Gemini 2.5 Pro (10K) | 6 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 28 | Claude Opus 4 (10K) | 5.50 | Claude Opus 4 anthropic-claude-opus-4 | Imported | 2026-05-06 |
| 29 | Qwen3-4B | 5.50 | — | Imported | 2026-05-06 |
| 30 | DeepSeek-R1 (10K) | 5 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 31 | DeepSeek-R1 (Qwen-14B) (10K) | 5 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 32 | DeepSeek-R1-0528 (10K) | 4.50 | R1 0528 deepseek-deepseek-r1-0528 | Imported | 2026-05-06 |
| 33 | Gemini 2.5 Flash (10K) | 4.50 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 34 | Grok 3 | 3.50 | Grok 3 xaigrok-3 | Imported | 2026-05-06 |
| 35 | DeepSeek-R1 (Llama-70B) (10K) | 3.50 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 36 | Gemini 2.0 Flash | 3 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-06 |
| 37 | Claude Sonnet 4 (10K) | 3 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 38 | GPT-4o | 3 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 39 | Qwen2.5-7B | 3 | — | Imported | 2026-05-06 |
| 40 | Qwen2.5-72B | 2.50 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-06 |
| 41 | GPT-4.1 | 2.50 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 42 | Llama-4-Maverick | 2.50 | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-06 |
| 43 | QwQ-32B (10K) | 2 | — | Imported | 2026-05-06 |
| 44 | QwQ-32B-preview (10K) | 2 | — | Imported | 2026-05-06 |
| 45 | Claude 3.7 Sonnet (10K) | 2 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-06 |
| 46 | GPT-4o mini | 2 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-06 |
| 47 | Qwen2.5-Coder-32B | 1.50 | — | Imported | 2026-05-06 |
| 48 | Llama-4-Scout | 1.50 | Llama 4 Scout meta-llama-llama-4-scout | Imported | 2026-05-06 |
| 49 | Gemini 2.0 Flash-Lite | 1.50 | Gemini 2.0 Flash Lite google-gemini-2.0-flash-lite-001 | Imported | 2026-05-06 |
| 50 | Claude 3.7 Sonnet (8K) | 1 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-06 |
| 51 | DeepSeek-R1 (Qwen-1.5B) (10K) | 0.50 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 52 | Gemma-2-9B (6K) | 0 | — | Imported | 2026-05-06 |
| 53 | Llama-3.1-8B | 0 | — | Imported | 2026-05-06 |
| 54 | Llama-3.2-3B | 0 | — | Imported | 2026-05-06 |
| 55 | Gemma-2B (6K) | 0 | — | Imported | 2026-05-06 |
No matching rows.