Gender bias in language models has attracted sufficient attention because it threatens social justice. However, most of the current debiasing methods degraded the model's performance on other tasks while the degradation mechanism is still mysterious. We propose a theoretical framework explaining the three candidate mechanisms of the language model's gender bias. We use our theoretical framework to explain why the current debiasing methods cause performance degradation. We also discover a pathway through which debiasing will not degrade the model performance. We further develop a causality-detection fine-tuning approach to correct gender bias. The numerical experiment demonstrates that our method is able to lead to double dividends: partially mitigating gender bias while avoiding performance degradation.
翻译:语言模式中的性别偏见已引起足够重视,因为它威胁到社会正义;然而,目前大多数贬低性倾向的方法削弱了该模式在其他任务上的绩效,而退化机制仍然神秘;我们提出了一个理论框架,解释语言模式的性别偏见的三个候选机制;我们利用我们的理论框架解释当前贬低性倾向的方法导致性能退化的原因;我们还发现了一种途径,通过这种途径,贬低性倾向不会降低该模式的绩效;我们进一步制定了一种因果性检测微调方法,以纠正性别偏见;数字实验表明,我们的方法能够带来双重红利:部分减少性别偏见,同时避免性能退化。