Hate Speech takes many forms to target communities with derogatory comments, and takes humanity a step back in societal progress. HateXplain is a recently published and first dataset to use annotated spans in the form of rationales, along with speech classification categories and targeted communities to make the classification more humanlike, explainable, accurate and less biased. We tune BERT to perform this task in the form of rationales and class prediction, and compare our performance on different metrics spanning across accuracy, explainability and bias. Our novelty is threefold. Firstly, we experiment with the amalgamated rationale class loss with different importance values. Secondly, we experiment extensively with the ground truth attention values for the rationales. With the introduction of conservative and lenient attentions, we compare performance of the model on HateXplain and test our hypothesis. Thirdly, in order to improve the unintended bias in our models, we use masking of the target community words and note the improvement in bias and explainability metrics. Overall, we are successful in achieving model explanability, bias removal and several incremental improvements on the original BERT implementation.
翻译:仇恨言论以多种形式针对具有贬义性评论的社区,使人类倒退了社会进步。HateXplain是最近出版的第一套数据集,它以理由、言语分类类别和有针对性的社区的形式使用附加注释的跨度,使分类更加人性化、可解释、准确和减少偏见。我们调用BERT,以理由和阶级预测的形式执行这项任务,并比较我们在不同指标上的表现,涵盖准确性、可解释性和偏见。我们的新颖之处有三重。首先,我们试验混合理论类损失,具有不同重要价值。第二,我们广泛试验地面真理关注值作为理由。随着采用保守和宽容的注意力,我们比较HateXplain模型的性能并测试我们的假设。第三,为了改进我们模型中无意中的偏差,我们用目标社区语言遮掩,注意到偏见和可解释性衡量度值的改进。总体而言,我们成功地实现了模型的可规划性、消除偏见和对原始BERT实施的若干渐进性改进。