NPHardEval

Dynamic reasoning benchmark grounded in computational complexity classes, evaluating LLMs on P, NP-complete, and NP-hard algorithmic tasks with weighted accuracy.

12rows
average_weighted_accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Average Weighted Accuracy, P Weighted Accuracy, NP-complete Weighted Accuracy, NP-hard Weighted Accuracy, SAS Weighted Accuracy, EDP Weighted Accuracy, SPP Weighted Accuracy, GCP_D Weighted Accuracy, KSP Weighted Accuracy, TSP_D Weighted Accuracy, GCP Weighted Accuracy, MSP Weighted Accuracy, TSP Weighted Accuracy

Latest Results

Rows are imported from the NPHardEval-results Hugging Face dataset. P, NP-complete, and NP-hard aggregates are computed from the task groupings used by the project README leaderboard.

Rank Subject Average Weighted Accuracy Model Match Provenance Sampled
1 GPT-4-Turbo 0.38 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
2 Claude-2 0.26 Imported 2026-05-06
3 GPT-3.5-Turbo 0.26 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
4 Claude-Instant 0.24 Imported 2026-05-06
5 Qwen/Qwen-14B-Chat 0.22 Imported 2026-05-06
6 01-ai/Yi-34B-Chat 0.19 Imported 2026-05-06
7 mistralai/Mistral-7B-Instruct-v0.1 0.18 Mistral: Mistral 7B Instruct v0.1
mistralai-mistral-7b-instruct-v0.1
Imported 2026-05-06
8 PaLM-2 0.16 Imported 2026-05-06
9 microsoft/phi-2 0.09 Imported 2026-05-06
10 lmsys/vicuna-13b-v1.3 0.08 Imported 2026-05-06
11 microsoft/phi-1_5 0.00 Imported 2026-05-06
12 mosaicml/mpt-30b-instruct 0.00 Imported 2026-05-06