COMPL-AI | BenchmarkList

Metadata

ID: compl_ai
Category: Safety
Release: 2024-10-10
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

COMPL-AI average, Task coverage, Prejudiced Answers: BBQ, Biased Completions: BOLD, Toxic Completions of Benign Text: RealToxicityPrompts, Following Harmful Instructions: AdvBench, Monotonicity Checks, Self-Check Consistency, BoolQ Contrast Set, IMDB Contrast Set, Logit Calibration: BIG-Bench, Self-Assessment: TriviaQA, Income Fairness: DecodingTrust, Common Sense Reasoning: HellaSwag, Coding: HumanEval, Goal Hijacking and Prompt Leakage, Rule Following, Representation Bias: RedditBias, Truthfulness: TruthfulQA MC2, General Knowledge: MMLU, Reasoning: AI2 Reasoning Challenge, Denying Human Presence, Copyrighted Material Memorization, PII Extraction by Association, Recommendation Consistency: FaiRLLM, MMLU: Robustness, Watermark Reliability & Robustness, Bias of the Dataset, Toxicity of the Dataset

Rank	Subject	COMPL-AI average	Model Match	Provenance	Sampled
1	gpt-4-1106-preview	0.86	—	Imported	2026-05-06
2	Claude3Opus	0.85	—	Imported	2026-05-06
3	gemini-1.5-flash-001	0.80	—	Imported	2026-05-06
4	gpt-3.5-turbo-0125	0.77	—	Imported	2026-05-06
5	01-ai/Yi-34B-Chat	0.72	—	Imported	2026-05-06
6	Qwen/Qwen1.5-72B-Chat	0.72	—	Imported	2026-05-06
7	speakleash/Bielik-11B-v2.3-Instruct	0.71	—	Imported	2026-05-06
8	meta-llama/Llama-2-70b-chat-hf	0.70	—	Imported	2026-05-06
9	mistralai/Mixtral-8x7B-Instruct-v0.1	0.70	Mistral: Mixtral 8x7B Instruct mistralai-mixtral-8x7b-instruct	Imported	2026-05-06
10	mistralai/Mistral-7B-Instruct-v0.3	0.68	—	Imported	2026-05-06
11	mistralai/Mistral-7B-Instruct-v0.2	0.67	—	Imported	2026-05-06
12	meta-llama/Llama-2-13b-chat-hf	0.66	—	Imported	2026-05-06
13	mistralai/Mistral-7B-v0.3	0.66	—	Imported	2026-05-06
14	meta-llama/Llama-2-7b-chat-hf	0.63	—	Imported	2026-05-06
15	google/gemma-2-9b	0.58	—	Imported	2026-05-06

Metadata

Metrics

Latest Results