StereoSet
StereoSet: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
11rows
icat_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
ICAT Score, LM Score, SS Score (lower is better), Example count, intrasentence ICAT Score, intrasentence LM Score, intrasentence SS Score (lower is better), intersentence ICAT Score, intersentence LM Score, intersentence SS Score (lower is better)
| Rank | Subject | ICAT Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt2 | 71.88928% | — | Imported | 2026-05-27 |
| 2 | gpt2-medium | 71.49694% | — | Imported | 2026-05-27 |
| 3 | xlnet-large-cased | 71.240645% | — | Imported | 2026-05-27 |
| 4 | bert-base-cased | 69.379538% | — | Imported | 2026-05-27 |
| 5 | bert-large-cased | 69.21552% | — | Imported | 2026-05-27 |
| 6 | EnsembleModel | 68.958214% | — | Imported | 2026-05-27 |
| 7 | roberta-base | 68.832156% | — | Imported | 2026-05-27 |
| 8 | gpt2-large | 67.825437% | — | Imported | 2026-05-27 |
| 9 | roberta-large | 67.367678% | — | Imported | 2026-05-27 |
| 10 | xlnet-base-cased | 61.628847% | — | Imported | 2026-05-27 |
| 11 | SentimentModel | 49.644261% | — | Imported | 2026-05-27 |
No matching rows.