CyberBench
CyberBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
13rows
average_scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Average Score, CyNER (F1), APTNER (F1), CyNews (R-1/2/L), SecMMLU (Accuracy), CyQuiz (Accuracy), MITRE (Accuracy), CVE (Accuracy), Web (F1), Email (F1), HTTP (F1)
| Rank | Subject | Average Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4 | 69.6% | GPT-4 openai-gpt-4 | Imported | 2026-05-28 |
| 2 | GPT-3.5-Turbo | 62.6% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 3 | Mistral-7B-v0.1 | 58.1% | — | Imported | 2026-05-28 |
| 4 | Zephyr-7B-beta | 57.7% | — | Imported | 2026-05-28 |
| 5 | Vicuna-13B-v1.5 | 57.3% | — | Imported | 2026-05-28 |
| 6 | Mistral-7B-Instruct-v0.1 | 55% | Mistral: Mistral 7B Instruct v0.1 mistralai-mistral-7b-instruct-v0.1 | Imported | 2026-05-28 |
| 7 | Llama-2-13B | 54.1% | — | Imported | 2026-05-28 |
| 8 | Vicuna-7B-v1.5 | 53% | — | Imported | 2026-05-28 |
| 9 | Llama-2-7B | 50.6% | — | Imported | 2026-05-28 |
| 10 | Llama-2-13B-Chat | 45% | — | Imported | 2026-05-28 |
| 11 | Llama-2-7B-Chat | 44.6% | — | Imported | 2026-05-28 |
| 12 | Falcon-7B | 39.4% | — | Imported | 2026-05-28 |
| 13 | Falcon-7B-Instruct | 37.5% | — | Imported | 2026-05-28 |
No matching rows.