Detecting and mitigating harmful biases in modern language models are widely recognized as crucial, open problems. In this paper, we take a step back and investigate how language models come to be biased in the first place. We use a relatively small language model, using the LSTM architecture trained on an English Wikipedia corpus. With full access to the data and to the model parameters as they change during every step while training, we can map in detail how the representation of gender develops, what patterns in the dataset drive this, and how the model's internal state relates to the bias in a downstream task (semantic textual similarity). We find that the representation of gender is dynamic and identify different phases during training. Furthermore, we show that gender information is represented increasingly locally in the input embeddings of the model and that, as a consequence, debiasing these can be effective in reducing the downstream bias. Monitoring the training dynamics, allows us to detect an asymmetry in how the female and male gender are represented in the input embeddings. This is important, as it may cause naive mitigation strategies to introduce new undesirable biases. We discuss the relevance of the findings for mitigation strategies more generally and the prospects of generalizing our methods to larger language models, the Transformer architecture, other languages and other undesirable biases.
翻译:发现和减少现代语言模式中的有害偏见被公认为至关重要的、公开的问题。在本文中,我们先退一步,调查语言模式如何产生偏向。我们使用相对较小的语言模式,使用在英语维基百科中经过培训的LSTM架构;在充分获取数据和模型参数的同时,在培训过程中每一步都发生变化,我们就可以详细绘制性别代表性如何发展、数据集中哪些模式驱动这一模式,以及模型的内部状态如何与下游任务(经典文本相似性)中的偏向相关。我们发现,性别代表性是动态的,在培训期间确定了不同的阶段。此外,我们显示,性别信息越来越多地在当地体现在模型的投入中,因此,降低性别偏见可以有效地减少下游偏差。监测培训动态,使我们能够发现在投入嵌入过程中男女性别的分布是否不对称。这很重要,因为它可能会导致对新语言的偏差产生天真的缓解战略。我们讨论的是,有关结果对于减缓战略的关联性,总体而言,以及我们其他的变异性结构的前景。