几何事项:在决定界线上探讨语言实例 (Geometry matters: Exploring language examples at the decision boundary)

A growing body of recent evidence has highlighted the limitations of natural language processing (NLP) datasets and classifiers. These include the presence of annotation artifacts in datasets, classifiers relying on shallow features like a single word (e.g., if a movie review has the word "romantic", the review tends to be positive), or unnecessary words (e.g., learning a proper noun to classify a movie as positive or negative). The presence of such artifacts has subsequently led to the development of challenging datasets to force the model to generalize better. While a variety of heuristic strategies, such as counterfactual examples and contrast sets, have been proposed, the theoretical justification about what makes these examples difficult is often lacking or unclear. In this paper, using tools from information geometry, we propose a theoretical way to quantify the difficulty of an example in NLP. Using our approach, we explore difficult examples for two popular NLP architectures. We discover that both BERT and CNN are susceptible to single word substitutions in high difficulty examples. Consequently, examples with low difficulty scores tend to be robust to multiple word substitutions. Our analysis shows that perturbations like contrast sets and counterfactual examples are not necessarily difficult for the model, and they may not be accomplishing the intended goal. Our approach is simple, architecture agnostic, and easily extendable to other datasets. All the code used will be made publicly available, including a tool to explore the difficult examples for other datasets.

翻译：最近越来越多的大量证据突显了自然语言处理(NLP)数据集和分类的局限性,其中包括在数据集中存在批注手工艺品、依赖单词等浅质特征的分类师(例如,如果电影审查有“罗马语”一词,审查往往是积极的)或不必要的词(例如,学习适当的名词,将电影归类为正面或负面),自然语言处理(NLP)数据集和分类师的局限性。这些工艺品的存在随后导致开发具有挑战性的数据集,迫使模型更好地概括化。虽然已经提出了各种超常策略,例如反事实例子和对比组,但往往缺乏或不清楚这些例子的理论依据(例如,如果电影审查有“罗马语”一词,审查往往具有积极性),或不必要的词词词词词词(例如,如果电影审查有“罗马语”一词,审查往往具有积极性),或者(例如,如果学习适当的名副词,那么我们用一个理论来量化NLP的例子,我们就会为两种流行的NLP结构探索困难的例子。我们发现,布尔特和CNNC很容易找到一个单词替代的词。因此,低难度的分数例子往往会比喻更难,而其他的参数也不一定比喻更难用到要用一个简单、更难的比喻,我们的目标替代。我们的目标更难的模型来完成一个目标。我们的数据。我们的分析显示我们的目标是用来用来用来做一个目标的模型。