Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations/lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geocultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.
翻译:在线社交媒体平台日益依赖自然语言处理技术来大规模检测滥用内容,以减轻对用户的伤害;然而,这些技术在培训数据中受到各种抽样和关联偏见的影响,往往导致与边缘化群体相关内容的分级性表现,从而可能加剧对这些群体的过度伤害;迄今为止,关于这些偏见的研究只侧重于少数具有说明/灵活性的差别轴心和分组;因此,文献中基本上忽视了对非西方环境的偏见;在本文件中,我们采用了一种监督不力的方法,以便在更广泛的地理文化环境中强有力地探测出逻辑偏见;我们通过对公开可得的毒性检测模型进行案例研究,表明我们的方法确定了跨地理错误的突出群体,并在后续中表明,这些群体反映了这些地理环境中对攻击性语言和不敏感语言的人类判断;我们还对一个经过培训的、具有地面真相标签的数据集模型进行了分析,以更好地了解这些偏见,并提出了初步的缓解实验。