IneqMath

Olympiad-level inequality proof benchmark evaluating both final-answer correctness and step-wise reasoning soundness for chat and reasoning LLMs.

55rows
overall_accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Overall Accuracy, Answer Accuracy, Step Accuracy (NTC), Step Accuracy (NLG), Step Accuracy (NAE), Step Accuracy (NCE)

Latest Results

Rows are parsed from the IneqMath README leaderboard table for the test set. Source display names are preserved; icons are normalized only in metadata.

Rank Subject Overall Accuracy Model Match Provenance Sampled
1 GPT-5 (medium, 30K) 47 GPT-5
openai-gpt-5
Imported 2026-05-06
2 o3-pro (medium, 40K) 46 o3 Pro
openai-o3-pro
Imported 2026-05-06
3 Gemini 2.5 Pro Preview (40K) 46 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Imported 2026-05-06
4 o3-pro (medium, 10K) 45.50 o3 Pro
openai-o3-pro
Imported 2026-05-06
5 Gemini 2.5 Pro (30K) 43.50 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
6 o3 (medium, 40K) 37 o3
openai-o3
Imported 2026-05-06
7 GPT-5 mini (medium, 10K) 30.50 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-06
8 GPT-5 (medium, 10K) 28 GPT-5
openai-gpt-5
Imported 2026-05-06
9 Gemini 2.5 Flash Preview 05-20 (40K) 27.50 Imported 2026-05-06
10 gpt-oss-120b (10K) 23.50 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
11 Gemini 2.5 Flash (40K) 23.50 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
12 o3 (medium, 10K) 21 o3
openai-o3
Imported 2026-05-06
13 DeepSeek-V3.1 (Thinking Mode) (30K) 15.50 DeepSeek V3.1
deepseek-deepseek-chat-v3.1
Imported 2026-05-06
14 o4-mini (medium, 10K) 15.50 o4 Mini
openai-o4-mini
Imported 2026-05-06
15 Gemini 2.5 Flash Preview 05-20 (10K) 14.50 Imported 2026-05-06
16 DeepSeek-V3.1 (Thinking Mode) (10K) 12 DeepSeek V3.1
deepseek-deepseek-chat-v3.1
Imported 2026-05-06
17 Gemini 2.5 Pro Preview (10K) 10 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Imported 2026-05-06
18 DeepSeek-R1-0528 (40K) 9.50 R1 0528
deepseek-deepseek-r1-0528
Imported 2026-05-06
19 o3-mini (medium, 10K) 9.50 o3-mini
openai-o3-mini
Imported 2026-05-06
20 Kimi K2 Instruct 9 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-06
21 Grok 4 (40K) 8 GROK Grok 4
x-ai-grok-4
Imported 2026-05-06
22 o1 (medium, 10K) 8 o1
openai-o1
Imported 2026-05-06
23 o1 (medium, 40K) 7.50 o1
openai-o1
Imported 2026-05-06
24 DeepSeek-V3-0324 7 DeepSeek V3 0324
deepseek-deepseek-chat-v3-0324
Imported 2026-05-06
25 Grok 3 mini (medium, 10K) 6 GROK Grok 3 Mini
x-ai-grok-3-mini
Imported 2026-05-06
26 Qwen3-235B-A22B (10K) 6 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
27 Gemini 2.5 Pro (10K) 6 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
28 Claude Opus 4 (10K) 5.50 Claude Opus 4
anthropic-claude-opus-4
Imported 2026-05-06
29 Qwen3-4B 5.50 Imported 2026-05-06
30 DeepSeek-R1 (10K) 5 R1
deepseek-r1
Imported 2026-05-06
31 DeepSeek-R1 (Qwen-14B) (10K) 5 R1
deepseek-r1
Imported 2026-05-06
32 DeepSeek-R1-0528 (10K) 4.50 R1 0528
deepseek-deepseek-r1-0528
Imported 2026-05-06
33 Gemini 2.5 Flash (10K) 4.50 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
34 Grok 3 3.50 GROK Grok 3
xaigrok-3
Imported 2026-05-06
35 DeepSeek-R1 (Llama-70B) (10K) 3.50 R1
deepseek-r1
Imported 2026-05-06
36 Gemini 2.0 Flash 3 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-06
37 Claude Sonnet 4 (10K) 3 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
38 GPT-4o 3 GPT-4o
openai-gpt-4o
Imported 2026-05-06
39 Qwen2.5-7B 3 Imported 2026-05-06
40 Qwen2.5-72B 2.50 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
41 GPT-4.1 2.50 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
42 Llama-4-Maverick 2.50 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-06
43 QwQ-32B (10K) 2 Imported 2026-05-06
44 QwQ-32B-preview (10K) 2 Imported 2026-05-06
45 Claude 3.7 Sonnet (10K) 2 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
46 GPT-4o mini 2 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-06
47 Qwen2.5-Coder-32B 1.50 Imported 2026-05-06
48 Llama-4-Scout 1.50 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-06
49 Gemini 2.0 Flash-Lite 1.50 Gemini 2.0 Flash Lite
google-gemini-2.0-flash-lite-001
Imported 2026-05-06
50 Claude 3.7 Sonnet (8K) 1 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
51 DeepSeek-R1 (Qwen-1.5B) (10K) 0.50 R1
deepseek-r1
Imported 2026-05-06
52 Gemma-2-9B (6K) 0 Imported 2026-05-06
53 Llama-3.1-8B 0 Imported 2026-05-06
54 Llama-3.2-3B 0 Imported 2026-05-06
55 Gemma-2B (6K) 0 Imported 2026-05-06