Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
翻译:自动检测有毒语言在保护社交媒体使用者,特别是少数群体免遭口头虐待方面发挥着至关重要的作用,然而,大多数用于检测毒性的培训数据集中都存在对性别、种族和方言等某些属性的偏见。这些偏见使得所学的模型不公平,甚至可能加剧人们的边缘化。考虑到当前一般自然语言理解任务中的贬低方法无法有效减轻毒性检测器的偏差,我们提议使用由理由生成器和预测器组成的游戏理论合理化框架(InvRat),以排除某些合成模式(例如身份提及、方言)与毒性标签之间的虚假关联。我们从经验上看,我们的方法在词汇和方言属性上产生的反正率都低于以往的偏差方法。