说明错误检测:分析过去和现在,以创造更加协调一致的未来 (Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future)

Annotated data is an essential ingredient in natural language processing for training and evaluating machine learning models. It is therefore very desirable for the annotations to be of high quality. Recent work, however, has shown that several popular datasets contain a surprising amount of annotation errors or inconsistencies. To alleviate this issue, many methods for annotation error detection have been devised over the years. While researchers show that their approaches work well on their newly introduced datasets, they rarely compare their methods to previous work or on the same datasets. This raises strong concerns on methods' general performance and makes it difficult to asses their strengths and weaknesses. We therefore reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets for text classification as well as token and span labeling. In addition, we define a uniform evaluation setup including a new formalization of the annotation error detection task, evaluation protocol and general best practices. To facilitate future research and reproducibility, we release our datasets and implementations in an easy-to-use and open source software package.

翻译：附加说明数据是培训和评价机器学习模式自然语言处理的基本内容,因此,说明的质量很高,这是非常可取的。不过,最近的工作表明,一些流行的数据集含有数量惊人的批注错误或不一致之处。为了缓解这一问题,多年来设计了许多批注错误探测方法。研究人员表明,他们的方法在新引入的数据集方面效果良好,但他们很少将其方法与以往工作或同一数据集的方法进行比较。这引起了对方法总体性能的强烈关切,难以评估其优缺点。因此,我们重新采用18种方法来发现潜在的批注错误,并在9个英文数据集上评价这些错误,用于文本分类以及标有符号和横幅标签。此外,我们定义了一个统一的评价设置,包括将批注错误探测任务、评价协议和一般最佳做法的新的正规化。为了便利未来的研究和可追溯性,我们以易于使用和开放源软件包发布我们的数据集和实施。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日