Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from cheating feature biases is difficult. Chief among these biases is single sentence label leakage, where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.
翻译:建立自然语言推断基准对于现代技术来说具有挑战性,而且没有欺骗性特征偏见,这些基准很难建立。 其中最主要的偏差是单句标签渗漏,在单句中,注解者引入的虚假关联产生数据集,只能从单句中准确预测对等(假设、假设)之间的逻辑关系,这在原则上是不可能做到的。我们证明,尽管努力减少这种渗漏,但自2018年发现以来引入的现代数据集中一直存在。为了使今后的改善努力得以进行,引入了一种新型的模型驱动技术,即对集群外端(PECO)的渐进评估,从而能够客观测量渗漏的客观程度,并在最大程度上展示的数据中自动检测亚群。