StereoSet

StereoSet: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

11rows
icat_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

ICAT Score, LM Score, SS Score (lower is better), Example count, intrasentence ICAT Score, intrasentence LM Score, intrasentence SS Score (lower is better), intersentence ICAT Score, intersentence LM Score, intersentence SS Score (lower is better)

Latest Results

Rows are parsed from the public StereoSet checked-in prediction summary text. The source repository notes it is no longer actively maintained and points users to Bias Bench for updated code/test data.

Rank Subject ICAT Score Model Match Provenance Sampled
1 gpt2 71.88928% Imported 2026-05-27
2 gpt2-medium 71.49694% Imported 2026-05-27
3 xlnet-large-cased 71.240645% Imported 2026-05-27
4 bert-base-cased 69.379538% Imported 2026-05-27
5 bert-large-cased 69.21552% Imported 2026-05-27
6 EnsembleModel 68.958214% Imported 2026-05-27
7 roberta-base 68.832156% Imported 2026-05-27
8 gpt2-large 67.825437% Imported 2026-05-27
9 roberta-large 67.367678% Imported 2026-05-27
10 xlnet-base-cased 61.628847% Imported 2026-05-27
11 SentimentModel 49.644261% Imported 2026-05-27