Knowledge-enhanced language representation learning has shown promising results across various knowledge-intensive NLP tasks. However, prior methods are limited in efficient utilization of multilingual knowledge graph (KG) data for language model (LM) pretraining. They often train LMs with KGs in indirect ways, relying on extra entity/relation embeddings to facilitate knowledge injection. In this work, we explore methods to make better use of the multilingual annotation and language agnostic property of KG triples, and present novel knowledge based multilingual language models (KMLMs) trained directly on the knowledge triples. We first generate a large amount of multilingual synthetic sentences using the Wikidata KG triples. Then based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to enable the LMs to not only memorize the factual knowledge but also learn useful logical patterns. Our pretrained KMLMs demonstrate significant performance improvements on a wide range of knowledge-intensive cross-lingual tasks, including named entity recognition (NER), factual knowledge retrieval, relation classification, and a newly designed logical reasoning task.
翻译:知识强化的语言代表性学习在各种知识密集型国家语言规划任务中显示出了可喜的成果,然而,在有效使用多语言知识图表数据(KG)数据进行语言模型预培训方面,以往的方法有限,通常以间接方式对LMS进行KG培训,依靠外体/关系嵌入,以便利知识注入。在这项工作中,我们探索各种方法,更好地利用多语种说明和语言对KG三重特性的多语种特性,并提出以知识三重直接培训的基于多语言的新知识模型。我们首先利用Wikigata KG三重语言生成大量多语种合成句子。然后,根据生成的数据的内部和流体结构,我们设计培训前的任务,使LMS不仅能够记住事实知识,而且还学习有用的逻辑模式。我们经过预先培训的KMLMS展示了在一系列知识密集型跨语言任务上的重大业绩改进,包括被命名的实体识别、事实知识检索、关系分类和新设计的逻辑推理任务。