Recent work has shown that pre-trained language models capture social biases from the text corpora they are trained on. This has attracted attention to developing techniques that mitigate such biases. In this work, we perform a empirical survey of five recently proposed debiasing techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three different bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) CDA and Self-Debias are the strongest of the debiasing techniques, obtaining improved scores on most of the bias benchmarks (2) Current debiasing techniques do not generalize well beyond gender bias; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are usually accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation is effective.
翻译:最近的工作表明,经过培训的语文模式从它们所培训的文字体体中反映了社会偏见,这吸引了对减少这种偏见的技术的注意。在这项工作中,我们对最近提出的五种偏向性技术进行了实证调查:(1) 反事实数据增强技术(CDA)、辍学技术、循环性大气投影技术、自失能技术以及句子Debias技术。我们用三种不同的偏差基准来量化每种技术的有效性,同时衡量这些技术对模型语言建模能力的影响,以及这些技术对下游NLU任务的业绩。我们实验发现:(1) CDA和自毁技术是最强的偏向性技术,在多数偏向基准上获得更好的分数 (2) 目前的偏向性技术除了性别偏见之外没有普遍化;(3) 通过使用偏向战略改进StereoSet和CrowS-Pairs等偏向基准,通常伴随着语言建模能力的下降,因此难以确定偏见缓解是否有效。