On social media platforms, hateful and offensive language negatively impact the mental well-being of users and the participation of people from diverse backgrounds. Automatic methods to detect offensive language have largely relied on datasets with categorical labels. However, comments can vary in their degree of offensiveness. We create the first dataset of English language Reddit comments that has fine-grained, real-valued scores between -1 (maximally supportive) and 1 (maximally offensive). The dataset was annotated using Best--Worst Scaling, a form of comparative annotation that has been shown to alleviate known biases of using rating scales. We show that the method produces highly reliable offensiveness scores. Finally, we evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.
翻译:在社交媒体平台上,仇恨和冒犯性语言对用户的心理健康和来自不同背景的人的参与产生了负面影响。自动检测攻击性语言的方法主要依赖带有绝对标签的数据集。然而,评论的冒犯程度可能不同。我们创建了英文评论的第一套数据集,该数据集在-1(最大支持性)和-1(最大进攻性)之间具有细微的、实际价值的分数。数据集使用最佳偏差缩放(一种比较性说明形式)附加了注释,这种比较性说明的形式已经显示,可以减轻使用评级尺度的已知偏差。我们显示该方法产生了非常可靠的攻击性分数。最后,我们评估了广泛使用的神经模型预测这一新数据集的冒犯性分数的能力。