Recent research has demonstrated how racial biases against users who write African American English exists in popular toxic language datasets. While previous work has focused on a single fairness criteria, we propose to use additional descriptive fairness metrics to better understand the source of these biases. We demonstrate that different benchmark classifiers, as well as two in-process bias-remediation techniques, propagate racial biases even in a larger corpus. We then propose a novel ensemble-framework that uses a specialized classifier that is fine-tuned to the African American English dialect. We show that our proposed framework substantially reduces the racial biases that the model learns from these datasets. We demonstrate how the ensemble framework improves fairness metrics across all sample datasets with minimal impact on the classification performance, and provide empirical evidence in its ability to unlearn the annotation biases towards authors who use African American English. ** Please note that this work may contain examples of offensive words and phrases.
翻译:最近的研究显示,在流行的有毒语言数据集中,对撰写非裔美国人英语的用户的种族偏见如何存在。虽然以前的工作侧重于单一的公平标准,但我们建议使用额外的描述性公平度量标准来更好地了解这些偏见的来源。我们证明,不同的基准分类人员以及两种过程中的偏向补救技术,甚至在更大范围内也传播种族偏见。我们然后提出一个新颖的混合框架,使用与非裔美国人英语方言相微调的专门分类人员。我们表明,我们提议的框架大大降低了模型从这些数据集中学习的种族偏见。我们证明,共同框架如何改善所有抽样数据集的公平度量度,对分类业绩影响最小,并提供经验证据,证明它有能力清除对使用非裔美国人英语的作者的注释偏见。 **请注意,这项工作可能包含攻击性言词和短语的例子。