Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word "amazing" on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.
翻译:国家劳工局最近的许多工作都记录了输入特征和输出标签之间的数据集文物、偏向和虚假关联。 但是,通常没有说明如何说明哪些特征具有“净化”而不是合法关联。 在这项工作中,我们争辩说,对于复杂的语言理解任务,所有简单的特征关联都是虚假的,我们将此概念正式化为我们称之为能力问题的一类问题。例如,“放大”一词本身不应提供独立于其外观的情绪标签的信息,其中可能包括否定、隐喻、讽刺等等。我们理论上分析在考虑人类偏向时为能力问题创建数据的困难,表明随着数据集大小的增加,现实的数据集将越来越偏离能力问题。这一分析为我们提供了一个简单的数据设置的统计测试,我们用来显示比先前工作中描述的更微妙的偏差,包括表明模型受到这些不那么极端的偏差的影响。我们对这个问题的理论处理还使我们能够分析所提出的解决办法,例如对数据设置实例进行本地编辑,以及提出未来数据收集和设计目标的能力问题的建议。