HH-RLHF

HH-RLHF: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

0rows
scoreprimary metric
sampled

Metadata

Metrics

Score

Latest Results

Rank Subject Score Model Match Provenance Sampled