In a hate speech detection model, we should consider two critical aspects in addition to detection performance-bias and explainability. Hate speech cannot be identified based solely on the presence of specific words: the model should be able to reason like humans and be explainable. To improve the performance concerning the two aspects, we propose Masked Rationale Prediction (MRP) as an intermediate task. MRP is a task to predict the masked human rationales-snippets of a sentence that are grounds for human judgment-by referring to surrounding tokens combined with their unmasked rationales. As the model learns its reasoning ability based on rationales by MRP, it performs hate speech detection robustly in terms of bias and explainability. The proposed method generally achieves state-of-the-art performance in various metrics, demonstrating its effectiveness for hate speech detection.
翻译:在仇恨言论检测模型中,除了检测性能偏差和可解释性之外,我们应该考虑两个关键方面。仇恨言论不能仅仅根据特定词的存在来识别:该模型应该能够像人类一样理性,并且可以解释。为了改善这两个方面的表现,我们建议将蒙面理由预测(MRP)作为中间任务。 MRP的任务是预测作为人类判断依据的一句话的隐藏的人类理由片段,通过提及周围的标语及其未暴露的理由来做出判断。当该模型根据MRP的理由来学习其推理能力时,它从偏见和解释性的角度有力地检测仇恨言论。拟议方法一般在各种尺度上达到最先进的表现,表明其检测仇恨言论的效果。