Bias mitigation approaches reduce models' dependence on sensitive features of data, such as social group tokens (SGTs), resulting in equal predictions across the sensitive features. In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each SGT. Here, to take the specific language about each SGT into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs. Our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat SGTs equally only within interchangeable contexts. By applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model performance on hate speech detection.
翻译:减轻偏见的办法减少了模式对敏感数据特征的依赖,如社会团体标牌(SGTs),从而导致敏感特征之间的预测平等。然而,在发现仇恨言论时,平衡模型预测可能忽略目标社会群体之间的重大差异,因为仇恨言论可能包含针对每个社会团体的陈规定型语言。这里,为了考虑到每个社会团体的具体语言,我们依靠的是反事实公平,并平衡对反事实的预测,而改变社会团体标牌(SGTs)则产生反事实的预测。我们的方法评估了反事实(通过经过事先训练的语言模型)的判刑可能性的相似性,只有可互换的环境才能同等对待特殊社会群体。我们通过对日志配对,在每种情况下限制的一套反事实上实现结果的平等性,我们改进了公平衡量标准,同时保留了仇恨言论检测的示范性表现。