NPHardEval
Dynamic reasoning benchmark grounded in computational complexity classes, evaluating LLMs on P, NP-complete, and NP-hard algorithmic tasks with weighted accuracy.
12rows
average_weighted_accuracyprimary metric
2026-05-06sampled
Metadata
Metrics
Average Weighted Accuracy, P Weighted Accuracy, NP-complete Weighted Accuracy, NP-hard Weighted Accuracy, SAS Weighted Accuracy, EDP Weighted Accuracy, SPP Weighted Accuracy, GCP_D Weighted Accuracy, KSP Weighted Accuracy, TSP_D Weighted Accuracy, GCP Weighted Accuracy, MSP Weighted Accuracy, TSP Weighted Accuracy
| Rank | Subject | Average Weighted Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4-Turbo | 0.38 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-06 |
| 2 | Claude-2 | 0.26 | — | Imported | 2026-05-06 |
| 3 | GPT-3.5-Turbo | 0.26 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-06 |
| 4 | Claude-Instant | 0.24 | — | Imported | 2026-05-06 |
| 5 | Qwen/Qwen-14B-Chat | 0.22 | — | Imported | 2026-05-06 |
| 6 | 01-ai/Yi-34B-Chat | 0.19 | — | Imported | 2026-05-06 |
| 7 | mistralai/Mistral-7B-Instruct-v0.1 | 0.18 | Mistral: Mistral 7B Instruct v0.1 mistralai-mistral-7b-instruct-v0.1 | Imported | 2026-05-06 |
| 8 | PaLM-2 | 0.16 | — | Imported | 2026-05-06 |
| 9 | microsoft/phi-2 | 0.09 | — | Imported | 2026-05-06 |
| 10 | lmsys/vicuna-13b-v1.3 | 0.08 | — | Imported | 2026-05-06 |
| 11 | microsoft/phi-1_5 | 0.00 | — | Imported | 2026-05-06 |
| 12 | mosaicml/mpt-30b-instruct | 0.00 | — | Imported | 2026-05-06 |
No matching rows.