Online harassment in the form of hate speech has been on the rise in recent years. Addressing the issue requires a combination of content moderation by people, aided by automatic detection methods. As content moderation is itself harmful to the people doing it, we desire to reduce the burden by improving the automatic detection of hate speech. Hate speech presents a challenge as it is directed at different target groups using a completely different vocabulary. Further the authors of the hate speech are incentivized to disguise their behavior to avoid being removed from a platform. This makes it difficult to develop a comprehensive data set for training and evaluating hate speech detection models because the examples that represent one hate speech domain do not typically represent others, even within the same language or culture. We propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision.
翻译:近年来,仇恨言论形式的在线骚扰呈上升趋势。 解决这一问题需要结合人们的内容节制,并辅之以自动检测方法。 由于内容节制本身对人们有害,我们希望通过改进对仇恨言论的自动检测来减轻负担。 仇恨言论是一个挑战,因为它针对的是使用完全不同词汇的不同目标群体。 此外,仇恨言论的作者受到激励,以掩盖他们的行为,避免被从平台上删除。因此很难为培训和评价仇恨言论检测模型开发一套综合数据集,因为一个仇恨言论领域代表的范例通常并不代表其他人,即使是在同一语言或文化中。 我们提议采用不受监督的域调适方法,以增加仇恨言论检测的标签数据。 我们用三种不同收藏的模型(CNN、BILSTMs和BERT)来评估这一方法。 我们用42 % 的精度/回回溯曲线显示我们的方法改进了区域,并提醒了278%的精确度,没有损失(有些是重大收益 ) 。