地面真相,谁的真相? -- -- 审视有毒文本数据集说明的挑战 (Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets)

The use of machine learning (ML)-based language models (LMs) to monitor content online is on the rise. For toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide ground-truth labels in an effort to distinguish between offensive and normal content. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that ML models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributing to discussions on navigating these challenges for existing and future projects. To achieve the goal of the study, we re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality. While this approach may not improve the traditional metric of inter-annotator agreement, it may better capture dependence on context and diversity in annotators. We discuss the implications of these results for both theory and practice.

翻译：使用机器学习(ML)语言模型(LM)监测在线内容的情况正在上升。关于毒性文本识别,使用提供地面真实标签以区分冒犯性和正常内容的注解者贴上标签的数据集对这些模型进行任务特定的微调,这些项目导致开发、改进和扩大大型数据集,并极大地促进了自然语言研究。尽管取得了这些成就,但现有证据表明,在这些数据集上建的ML模型并不总是产生理想的结果。因此,利用设计科学研究(DSR)方法,本研究审查选定的有毒文本数据集,目的是揭示某些固有问题,促进讨论如何应对现有和未来项目的挑战。为了实现研究的目标,我们重新对三个有毒文本数据集的样本进行注解,发现用多标签方法来说明有毒文本样本可以帮助提高数据设置的质量。虽然这一方法可能不会改进传统标识协议(DSR)方法,但本项研究审查选定的有毒文本数据集的目的是说明某些固有问题,并有助于讨论如何应对现有和未来项目面临的这些挑战。为了实现研究的目标,我们从三个有毒文本数据集中重新点出样本的标注,并发现用多标签方法可以帮助改进数据设置的质量。我们讨论这些理论对背景和多样性的影响。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

基于预训练语言模型的文本生成

专知会员服务

29+阅读 · 2022年1月28日

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日