HANS

HANS: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

4rows
overall_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Overall accuracy, lexical_overlap accuracy, subsequence accuracy, constituent accuracy

Latest Results

Rows are computed from public HANS repository prediction files and the checked-in heuristic evaluation set labels. No model identity remapping beyond source filenames is inferred.

Rank Subject Overall accuracy Model Match Provenance Sampled
1 esim 49.416667% Imported 2026-05-27
2 decomp attn 49.19% Imported 2026-05-27
3 bert 48.733333% Imported 2026-05-27
4 spinn 47.42% Imported 2026-05-27