RealToxicityPrompts

RealToxicityPrompts: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

42rows
toxic_fractionprimary metric
2026-05-27sampled

Metadata

Metrics

Toxic fraction (lower is better), Stereotypes (race) (lower is better), Stereotypes (gender) (lower is better), Representation (race) (lower is better), Representation (gender) (lower is better), Denoised inference time (lower is better), # eval, # prompt tokens (lower is better), # output tokens (lower is better), # trials

Latest Results

Rows are parsed from HELM Classic public GCS release artifacts for the RealToxicityPrompts group. Rank is assigned by lowest toxic fraction.

Rank Subject Toxic fraction Model Match Provenance Sampled
1 Palmyra X (43B) 0.00558 Imported 2026-05-27
2 Cohere medium v20220720 (6.1B) 0.008333 Imported 2026-05-27
3 T5 (11B) 0.009234 Imported 2026-05-27
4 Cohere Command beta (52.4B) 0.009874 Imported 2026-05-27
5 Cohere medium v20221108 (6.1B) 0.009874 Imported 2026-05-27
6 Cohere xlarge v20220609 (52.4B) 0.010061 Imported 2026-05-27
7 Cohere small v20220720 (410M) 0.010065 Imported 2026-05-27
8 Cohere Command beta (6.1B) 0.010127 Imported 2026-05-27
9 Cohere large v20220720 (13.1B) 0.010513 Imported 2026-05-27
10 Cohere xlarge v20221108 (52.4B) 0.012435 Imported 2026-05-27
11 Luminous Base (13B) 0.025905 Imported 2026-05-27
12 Luminous Extended (30B) 0.028088 Imported 2026-05-27
13 text-davinci-003 0.038652 Imported 2026-05-27
14 text-ada-001 0.03988 Imported 2026-05-27
15 UL2 (20B) 0.043144 Imported 2026-05-27
16 InstructPalmyra (30B) 0.048678 Imported 2026-05-27
17 GPT-NeoX (20B) 0.049953 Imported 2026-05-27
18 J1-Large v1 (7.5B) 0.051672 Imported 2026-05-27
19 GPT-J (6B) 0.052125 Imported 2026-05-27
20 OPT (175B) 0.052457 Imported 2026-05-27
21 babbage (1.3B) 0.053345 Imported 2026-05-27
22 ada (350M) 0.053856 Imported 2026-05-27
23 J1-Grande v1 (17B) 0.055837 Imported 2026-05-27
24 Luminous Supreme (70B) 0.056243 Imported 2026-05-27
25 J1-Jumbo v1 (178B) 0.056293 Imported 2026-05-27
26 curie (6.7B) 0.056297 Imported 2026-05-27
27 BLOOM (176B) 0.057057 Imported 2026-05-27
28 OPT (66B) 0.058029 Imported 2026-05-27
29 TNLG v2 (530B) 0.058353 Imported 2026-05-27
30 GLM (130B) 0.058473 Imported 2026-05-27
31 TNLG v2 (6.7B) 0.058664 Imported 2026-05-27
32 Anthropic-LM v4-s3 (52B) 0.058863 Imported 2026-05-27
33 Jurassic-2 Large (7.5B) 0.059623 Imported 2026-05-27
34 Jurassic-2 Jumbo (178B) 0.061874 Imported 2026-05-27
35 text-babbage-001 0.061895 Imported 2026-05-27
36 text-davinci-002 0.063028 Imported 2026-05-27
37 T0pp (11B) 0.063826 Imported 2026-05-27
38 J1-Grande v2 beta (17B) 0.064178 Imported 2026-05-27
39 text-curie-001 0.064431 Imported 2026-05-27
40 Jurassic-2 Grande (17B) 0.067321 Imported 2026-05-27
41 davinci (175B) 0.069712 Imported 2026-05-27
42 YaLM (100B) 0.092246 Imported 2026-05-27