Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. In contrast, knowledge embedding (KE) methods can effectively represent the relational facts in knowledge graphs (KGs) with informative entity embeddings, but conventional KE models cannot take full advantage of the abundant textual information. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs. In KEPLER, we encode textual entity descriptions with a PLM as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks, and also works remarkably well as an inductive KE model on KG link prediction. Furthermore, for pre-training and evaluating KEPLER, we construct Wikidata5M, a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. The source code can be obtained from https://github.com/THU-KEG/KEPLER.
翻译:培训前语言代表模式(PLM)无法很好地从文本中获取事实知识。相比之下,知识嵌入(KE)方法可以有效地代表知识图形(KGs)中的关联事实,并包含信息实体嵌入,但常规KE模式不能充分利用丰富的文本信息。在本文件中,我们提出了一个知识嵌入和预培训的LanguagE代表模式(KEPLER)的统一模式,该模式不仅能够更好地将事实知识融入到文本模型中,而且还能产生有效的文本强化 KEPLER。在 KEPLER中,我们用PLM将文本实体描述编码成文本实体描述,然后共同优化KE和语言建模目标。实验结果显示,KEPLER在各种NLP任务中实现了最新艺术表现,并在KG链接预测中做了出色和感化的KEE模型。此外,为了预先培训和评估KEPLER,我们建造了WikDDA5M,一个大型KG数据集,其实体描述一致,KEG将KG公司在KGBroduin研究中以KG为基准。