To mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. Considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of this. In this work, we design a probe to investigate the effects that some of the major intrinsic gender bias mitigation strategies have on downstream text classification tasks. We discover that instead of resolving gender bias, intrinsic mitigation techniques and metrics are able to hide it in such a way that significant gender information is retained in the embeddings. Furthermore, we show that each mitigation technique is able to hide the bias from some of the intrinsic bias measures but not all, and each intrinsic bias measure can be fooled by some mitigation techniques, but not all. We confirm experimentally, that none of the intrinsic mitigation techniques used without any other fairness intervention is able to consistently impact extrinsic bias. We recommend that intrinsic bias mitigation techniques should be combined with other fairness interventions for downstream tasks.
翻译:为了减轻背景化语言模型中的性别偏见,除许多偏见指标外,还提出了不同的内在缓解战略。考虑到这些语言模型的最终用途是用于文本分类等下游任务,重要的是要了解这些内在偏见缓解战略如何真正转化为下游任务的公平性和程度。在这项工作中,我们设计了一个调查,以调查某些主要固有的性别偏见缓解战略对下游文本分类任务的影响。我们发现,与其解决性别偏见,内在的缓解技术和衡量方法能够掩盖它,以致于在嵌入过程中保留重要的性别信息。此外,我们表明,每一种缓解技术都能够隐藏某些内在偏见措施的偏向,但并非全部,而每一种固有的偏见措施都可能被某些缓解技术所欺骗,但并非全部。我们实验性地确认,在没有任何其他公平干预措施的情况下使用的任何内在缓解技术都无法持续地影响极端偏见。我们建议,内在偏见缓解技术应当与下游任务的其他公平干预措施结合起来。