Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is "single sentence label leakage," where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.
翻译:建立自然语言推论(NLI)基准对于现代技术来说既具有挑战性,又没有捷径偏差,这是很困难的。其中最主要的偏差是“单句标签渗漏 ”, 说明人引入的虚假关联产生数据集, 只能从单句中准确预测对(假设、假设)对子之间的逻辑关系, 原则上这是不可能做到的。 我们表明,尽管努力减少这种渗漏,但它在自2018年发现以来引入的现代数据集中继续存在。 为了让未来能够作出改善努力,引入了一种新型的模型驱动技术,即对能够客观测量渗漏的集流出器(PECO)的渐进评估, 以及在最能展示出来的数据中自动检测亚群。