Language models have demonstrated strong performance on various natural language understanding tasks. Similar to humans, language models could also have their own bias that is learned from the training data. As more and more downstream tasks integrate language models as part of the pipeline, it is necessary to understand the internal stereotypical representation and the methods to mitigate the negative effects. In this paper, we proposed a simple method to test the internal stereotypical representation in pre-trained language models using counterexamples. We mainly focused on gender bias, but the method can be extended to other types of bias. We evaluated models on 9 different cloze-style prompts consisting of knowledge and base prompts. Our results indicate that pre-trained language models show a certain amount of robustness when using unrelated knowledge, and prefer shallow linguistic cues, such as word position and syntactic structure, to alter the internal stereotypical representation. Such findings shed light on how to manipulate language models in a neutral approach for both finetuning and evaluation.
翻译:语言模型已在各种自然语言理解任务中展示了强大的性能。与人类类似,语言模型也可能从训练数据中学到自己的偏见。随着越来越多的下游任务将语言模型作为管道的一部分集成,了解内部的刻板表现和缓解负面影响的方法变得必要。在本文中,我们提出了一种简单的方法,使用反例测试预训练语言模型中的内部刻板表现。我们主要关注性别偏见,但该方法可以扩展到其他类型的偏见。我们评估了9个不同的填空样式提示,包括知识和基本提示。我们的结果表明,预训练语言模型在使用不相关知识时表现出一定的鲁棒性,并倾向于使用浅层语言提示,如单词位置和句法结构,来改变内部刻板表现。这些发现为如何在微调和评估中以中性方法操作语言模型提供了启示。