Language models have demonstrated strong performance on various natural language understanding tasks. Similar to humans, language models could also have their own bias that is learned from the training data. As more and more downstream tasks integrate language models as part of the pipeline, it is necessary to understand the internal stereotypical representation and the methods to mitigate the negative effects. In this paper, we proposed a simple method to test the internal stereotypical representation in pre-trained language models using counterexamples. We mainly focused on gender bias, but the method can be extended to other types of bias. We evaluated models on 9 different cloze-style prompts consisting of knowledge and base prompts. Our results indicate that pre-trained language models show a certain amount of robustness when using unrelated knowledge, and prefer shallow linguistic cues, such as word position and syntactic structure, to alter the internal stereotypical representation. Such findings shed light on how to manipulate language models in a neutral approach for both finetuning and evaluation.
翻译:语言模式在各种自然语言理解任务上表现出了很强的成绩。与人类相似,语言模式也可以有其自身的偏见,从培训数据中汲取。随着越来越多的下游任务将语言模式纳入编审过程,有必要理解内部的陈规定型代表制和减轻负面影响的方法。在本文中,我们提出了一个简单的方法,用反实例来测试预先培训的语言模式中内部的陈规定型代表制。我们主要关注性别偏见,但该方法可以推广到其他类型的偏见。我们评估了由知识和基础提示组成的9种不同的凝块式提示式的模型。我们的结果显示,在使用不相关的知识时,预先培训的语言模式显示出一定的稳健性,并且更喜欢浅语言提示,如单词位置和合成结构,以改变内部的陈规定型代表制。这些发现揭示了如何以中立方式操纵语言模式,用于微调和评价。