The massive growth of social media usage has witnessed a tsunami of online toxicity in teams of hate speech, abusive posts, cyberbullying, etc. Detecting online toxicity is challenging due to its inherent subjectivity. Factors such as the context of the speech, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can lead to a sidelining of the various demographic and psychographic groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study to discuss the limitations and challenges of existing methods. We start by developing a taxonomy for categorising various unintended biases and a suite of evaluation metrics proposed to quantify such biases. We take a closer look at each proposed method for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation methods. The survey concludes with an overview of the critical challenges, research gaps and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.
翻译:社交媒体使用量的大规模增长见证了仇恨言论、虐待性文章、网络欺凌等团队在线毒性的海啸,这些团队中出现了仇恨言论、滥用性文章、网络欺凌等等的在线毒性的海啸,由于其固有的主观性,检测在线毒性具有挑战性; 诸如言论、地理、社会政治气候等背景因素,以及这些文章的制作者和消费者的背景,在确定内容是否可标注为有毒方面发挥着关键作用; 生产过程中采用自动毒性检测模型,可以导致他们首先旨在帮助的各种人口和精神病群体被排挤在一边; 研究人员对研究意外偏向和减轻其偏向的兴趣感兴趣。 由于工作的初创性和多面性,完整的文献在其术语、技术和结果方面混乱不清。 在本文中,我们共同开展系统研究,讨论现有方法的局限性和挑战。 我们首先开发一种分类,用于分析各种意外偏向的偏向,并提议一套评估指标来量化这种偏向。 我们更密切地研究每一个拟议的评价和减轻有毒言论检测偏向性的方法。