On social media platforms, hateful and offensive language negatively impact the mental well-being of users and the participation of people from diverse backgrounds. Automatic methods to detect offensive language have largely relied on datasets with categorical labels. However, comments can vary in their degree of offensiveness. We create the first dataset of English language Reddit comments that has \textit{fine-grained, real-valued scores} between -1 (maximally supportive) and 1 (maximally offensive). The dataset was annotated using \emph{Best--Worst Scaling}, a form of comparative annotation that has been shown to alleviate known biases of using rating scales. We show that the method produces highly reliable offensiveness scores. Finally, we evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.
翻译:在社交媒体平台上,仇恨和冒犯性语言对用户的心理健康和来自不同背景的人的参与产生了负面影响。自动检测攻击性语言的方法主要依靠带有绝对标签的数据集。然而,评论的冒犯程度可能不同。我们创建了英文评论的第一套数据集,该数据集在 -1(最大支持)和1(最大进攻性)之间,在 -1(最大支持)和1(最大进攻性)之间,我们评估了广泛使用的神经模型预测这个新数据集攻击性计分的能力。