BigCodeBench-Hard
BigCodeBench-Hard evaluates code generation on the harder BigCodeBench subset, reporting pass@1 in complete and instruct settings.
25rows
instruct_pass_at_1primary metric
2026-05-05sampled
Metadata
Metrics
Instruct pass@1, Complete pass@1
| Rank | Subject | Instruct pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3-mini-2025-01-31 (temperature=1, reasoning=medium) | 33.10 | o3-mini openai-o3-mini | Imported | 2026-05-05 |
| 2 | o1-2024-12-17 (temperature=1, reasoning=high) | 32.40 | o1 openai-o1 | Imported | 2026-05-05 |
| 3 | o3-mini-2025-01-31 (temperature=1, reasoning=high) | 32.40 | o3-mini openai-o3-mini | Imported | 2026-05-05 |
| 4 | Claude-3.7-Sonnet-20250219 (temperature=1, length=12800, reasoning=3200) | 32.40 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-05 |
| 5 | Claude-3.7-Sonnet-20250219 | 31.80 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-05 |
| 6 | Quasar-Alpha | 31.80 | — | Imported | 2026-05-05 |
| 7 | GPT-4.1-2025-04-14 | 31.80 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-05 |
| 8 | GPT-4.1-Mini-2025-04-14 | 31.80 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-05 |
| 9 | o3-mini-2025-01-31 (temperature=1, reasoning=low) | 31.10 | o3-mini openai-o3-mini | Imported | 2026-05-05 |
| 10 | Grok-3-Mini-Beta (temperature=1, reasoning=low) | 31.10 | Grok 3 Mini Beta x-ai-grok-3-mini-beta | Imported | 2026-05-05 |
| 11 | Optimus-Alpha | 30.40 | — | Imported | 2026-05-05 |
| 12 | Athene-V2-Agent | 29.70 | — | Imported | 2026-05-05 |
| 13 | o1-2024-12-17 (temperature=1, reasoning=low) | 29.70 | o1 openai-o1 | Imported | 2026-05-05 |
| 14 | DeepSeek-R1 | 29.70 | R1 deepseek-r1 | Imported | 2026-05-05 |
| 15 | QwQ-32B (w/ Reasoning) | 29.70 | — | Imported | 2026-05-05 |
| 16 | Gemini-2.5-Pro-Exp-03-25 | 29.70 | — | Imported | 2026-05-05 |
| 17 | GPT-4-Turbo-2024-04-09 | 29.10 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-05 |
| 18 | Qwen2.5-Max | 29.10 | — | Imported | 2026-05-05 |
| 19 | Llama-3.3-70B-Instruct | 28.40 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-05 |
| 20 | o1-2024-12-17 (temperature=1, reasoning=medium) | 28.40 | o1 openai-o1 | Imported | 2026-05-05 |
| 21 | DeepSeek-V3 | 28.40 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-05 |
| 22 | GPT-4.1-Nano-2025-04-14 | 28.40 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-05 |
| 23 | o1-Mini-2024-09-12 (temperature=1) | 27.70 | — | Imported | 2026-05-05 |
| 24 | Qwen2.5-Coder-32B-Instruct | 27.70 | Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct | Imported | 2026-05-05 |
| 25 | GPT-4o-2024-11-20 | 27.70 | GPT-4o (2024-11-20) openai-gpt-4o-2024-11-20 | Imported | 2026-05-05 |
No matching rows.