The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model's embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.
翻译:在大语言模式中,这些表述包含多种类型的性别信息。我们在英文文本中侧重于两种类型的此类信号:事实上的性别信息,这是一种语法或语义属性,以及性别偏见,这是单词和特定性别之间的相互关系。我们可以拆开该模式的嵌入部分,并找出将两种类型的信息编码为调查的成分。我们的目标是在保留事实性别信号的同时,减少这些表述中的陈规定型偏见。我们的过滤方法表明,在不显著削弱语言建模能力的情况下,有可能减少性别中立职业名称的偏见。这些发现可以应用于语言生成,以减少对陈规定型观念的依赖,同时在共同参照中保持性别协议。