Annotated data is an essential ingredient in natural language processing for training and evaluating machine learning models. It is therefore very desirable for the annotations to be of high quality. Recent work, however, has shown that several popular datasets contain a surprising amount of annotation errors or inconsistencies. To alleviate this issue, many methods for annotation error detection have been devised over the years. While researchers show that their approaches work well on their newly introduced datasets, they rarely compare their methods to previous work or on the same datasets. This raises strong concerns on methods' general performance and makes it difficult to asses their strengths and weaknesses. We therefore reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets for text classification as well as token and span labeling. In addition, we define a uniform evaluation setup including a new formalization of the annotation error detection task, evaluation protocol and general best practices. To facilitate future research and reproducibility, we release our datasets and implementations in an easy-to-use and open source software package.
翻译:附加说明数据是培训和评价机器学习模式自然语言处理的基本内容,因此,说明的质量很高,这是非常可取的。不过,最近的工作表明,一些流行的数据集含有数量惊人的批注错误或不一致之处。为了缓解这一问题,多年来设计了许多批注错误探测方法。研究人员表明,他们的方法在新引入的数据集方面效果良好,但他们很少将其方法与以往工作或同一数据集的方法进行比较。这引起了对方法总体性能的强烈关切,难以评估其优缺点。因此,我们重新采用18种方法来发现潜在的批注错误,并在9个英文数据集上评价这些错误,用于文本分类以及标有符号和横幅标签。此外,我们定义了一个统一的评价设置,包括将批注错误探测任务、评价协议和一般最佳做法的新的正规化。为了便利未来的研究和可追溯性,我们以易于使用和开放源软件包发布我们的数据集和实施。