CyberBench

CyberBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

13rows
average_scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Average Score, CyNER (F1), APTNER (F1), CyNews (R-1/2/L), SecMMLU (Accuracy), CyQuiz (Accuracy), MITRE (Accuracy), CVE (Accuracy), Web (F1), Email (F1), HTTP (F1)

Latest Results

Rows are imported from the public Frontier AI Cybersecurity Observatory results.json CyberBench section.

Rank Subject Average Score Model Match Provenance Sampled
1 GPT-4 69.6% GPT-4
openai-gpt-4
Imported 2026-05-28
2 GPT-3.5-Turbo 62.6% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
3 Mistral-7B-v0.1 58.1% Imported 2026-05-28
4 Zephyr-7B-beta 57.7% Imported 2026-05-28
5 Vicuna-13B-v1.5 57.3% Imported 2026-05-28
6 Mistral-7B-Instruct-v0.1 55% Mistral: Mistral 7B Instruct v0.1
mistralai-mistral-7b-instruct-v0.1
Imported 2026-05-28
7 Llama-2-13B 54.1% Imported 2026-05-28
8 Vicuna-7B-v1.5 53% Imported 2026-05-28
9 Llama-2-7B 50.6% Imported 2026-05-28
10 Llama-2-13B-Chat 45% Imported 2026-05-28
11 Llama-2-7B-Chat 44.6% Imported 2026-05-28
12 Falcon-7B 39.4% Imported 2026-05-28
13 Falcon-7B-Instruct 37.5% Imported 2026-05-28