Pre-trained language models (PLMs) have outperformed other NLP models on a wide range of tasks. Opting for a more thorough understanding of their capabilities and inner workings, researchers have established the extend to which they capture lower-level knowledge like grammaticality, and mid-level semantic knowledge like factual understanding. However, there is still little understanding of their knowledge of higher-level aspects of language. In particular, despite the importance of sociodemographic aspects in shaping our language, the questions of whether, where, and how PLMs encode these aspects, e.g., gender or age, is still unexplored. We address this research gap by probing the sociodemographic knowledge of different single-GPU PLMs on multiple English data sets via traditional classifier probing and information-theoretic minimum description length probing. Our results show that PLMs do encode these sociodemographics, and that this knowledge is sometimes spread across the layers of some of the tested PLMs. We further conduct a multilingual analysis and investigate the effect of supplementary training to further explore to what extent, where, and with what amount of pre-training data the knowledge is encoded. Our overall results indicate that sociodemographic knowledge is still a major challenge for NLP. PLMs require large amounts of pre-training data to acquire the knowledge and models that excel in general language understanding do not seem to own more knowledge about these aspects.
翻译:预先培训的语言模型(PLM)在广泛的任务方面比其他NLP模型(PLM)表现优于其他NLP模型。为了更透彻地了解他们的能力和内部工作,研究人员已经建立了他们获取低层次知识(如语法学)和中等语义知识(如事实理解)的延伸,然而,对于他们对语言的更高层面的了解仍然很少。特别是,尽管社会人口因素在塑造我们的语言方面很重要,但是是否、在哪里以及如何使PLM对这些方面(如性别或年龄)进行编码的问题仍未得到探讨。我们通过传统分类仪和资料理论最低描述长度,在多个英文数据集中展示不同的GPU PLMS的社会人口学知识,从而填补了这一研究差距。我们的研究结果表明,PLMS确实将这些社会人口学,而这种知识有时在经过测试的PLMS的一些层次上传播。我们进一步进行了多语种分析,并调查补充培训的效果,以进一步探索在什么程度上,在哪些程度上,在多层次上不同的GPUPMS PLM(M)中,我们所拥有的大量知识似乎需要大量数据。