In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.
翻译:近些年来,NLU的进展是由基准驱动的。 这些基准通常由众包收集, 由批注员根据数据集创建者编写的批注指示撰写实例。 在这项工作中, 我们假设批注员会采用众包指示的模式, 从而偏向于写在所收集的数据中代表过多的类似例子。 我们在最近的14个NLU基准中研究了这种偏见形式, 称为教学偏向, 表明教学示例往往呈现具体模式, 由人群工作者向所收集数据传播。 这延续了先前的工作( Geva等人, 2019年), 并提出了一个新的问题, 即我们是否正在模拟数据集创建者的指示, 而不是任务。 我们通过一系列实验, 表明, 指示偏向确实会导致对模型性能的过度估计, 以及模型努力将源自众包采指示的偏向范围外推, 我们进一步分析指令偏向模式频率和模型大小的影响, 并为未来NLU基准的建立提出具体的建议 。