Knowledge enriched language representation learning has shown promising performance across various knowledge-intensive NLP tasks. However, existing knowledge based language models are all trained with monolingual knowledge graph data, which limits their application to more languages. In this work, we present a novel framework to pretrain knowledge based multilingual language models (KMLMs). We first generate a large amount of code-switched synthetic sentences and reasoning-based multilingual training data using the Wikidata knowledge graphs. Then based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to facilitate knowledge learning, which allows the language models to not only memorize the factual knowledge but also learn useful logical patterns. Our pretrained KMLMs demonstrate significant performance improvements on a wide range of knowledge-intensive cross-lingual NLP tasks, including named entity recognition, factual knowledge retrieval, relation classification, and a new task designed by us, namely, logic reasoning. Our code and pretrained language models will be made publicly available.
翻译:在这项工作中,我们提出了一个为基于多语言语言的知识模型(KMLMs)预先培训的新框架。我们首先利用维基数据知识图表生成大量代码转换的合成句和基于推理的多语种培训数据。然后,根据生成的数据的内部和相互结构,我们设计培训前任务,以促进知识学习,使语言模型不仅能够记住事实知识,而且还能够学习有用的逻辑模式。我们预先培训的知识管理模型展示了广泛的知识密集型跨语言国家语言模型(KMLMs)任务方面的显著绩效改进,包括名称实体识别、事实知识检索、关系分类以及我们设计的新任务,即逻辑推理。我们的代码和经过预先培训的语言模型将公开提供。