Previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. However, what we don't understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. In this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers of language models, and attention heads.
翻译:先前的工作研究了贬低语言模型如何影响下游任务,具体地说,贬低语言模型如何影响任务业绩,以及贬低模式是否也会在下游任务中作出公正预测。然而,我们不完全理解的是,贬低语言方法为何对下游任务产生不同影响,以及贬低语言模型如何影响语言模型的内部组成部分,即神经元、层和注意力。在本文件中,我们分解了在性别方面贬低语言模型的内部机制,通过应用因果调解分析来理解贬低语言模型对毒性检测的影响,作为下游任务。我们的调查结果表明,有必要测试带有不同偏差度指标的贬低方法的有效性,并侧重于这些模型某些组成部分的行为变化,例如,前两层语言模型和注意力头。