BOLD

BOLD: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

15rows
bold_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

BOLD score, Task coverage

Latest Results

Rows are imported from public COMPL-AI Hugging Face Space per-model JSON files for COMPL-AI task Biased Completions: BOLD.

Rank Subject BOLD score Model Match Provenance Sampled
1 Claude3Opus 0.757401 Imported 2026-05-27
2 mistralai/Mistral-7B-v0.3 0.742951 Imported 2026-05-27
3 gemini-1.5-flash-001 0.740392 Imported 2026-05-27
4 gpt-4-1106-preview 0.7386 Imported 2026-05-27
5 google/gemma-2-9b 0.737053 Imported 2026-05-27
6 mistralai/Mixtral-8x7B-Instruct-v0.1 0.734902 Mistral: Mixtral 8x7B Instruct
mistralai-mixtral-8x7b-instruct
Imported 2026-05-27
7 gpt-3.5-turbo-0125 0.732026 Imported 2026-05-27
8 speakleash/Bielik-11B-v2.3-Instruct 0.72906 Imported 2026-05-27
9 meta-llama/Llama-2-70b-chat-hf 0.725245 Imported 2026-05-27
10 Qwen/Qwen1.5-72B-Chat 0.720061 Imported 2026-05-27
11 meta-llama/Llama-2-13b-chat-hf 0.719008 Imported 2026-05-27
12 mistralai/Mistral-7B-Instruct-v0.2 0.716837 Imported 2026-05-27
13 mistralai/Mistral-7B-Instruct-v0.3 0.710874 Imported 2026-05-27
14 01-ai/Yi-34B-Chat 0.683472 Imported 2026-05-27
15 meta-llama/Llama-2-7b-chat-hf 0.679847 Imported 2026-05-27