Prix-LM:多语言知识库建设预培训 (Prix-LM: Pretraining for Multilingual Knowledge Base Construction)

Knowledge bases (KBs) contain plenty of structured world and commonsense knowledge. As such, they often complement distributional text-based information and facilitate various downstream tasks. Since their manual construction is resource- and time-intensive, recent efforts have tried leveraging large pretrained language models (PLMs) to generate additional monolingual knowledge facts for KBs. However, such methods have not been attempted for building and enriching multilingual KBs. Besides wider application, such multilingual KBs can provide richer combined knowledge than monolingual (e.g., English) KBs. Knowledge expressed in different languages may be complementary and unequally distributed: this implies that the knowledge available in high-resource languages can be transferred to low-resource ones. To achieve this, it is crucial to represent multilingual knowledge in a shared/unified space. To this end, we propose a unified framework, Prix-LM, for multilingual KB construction and completion. We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs, and tune a multilingual language encoder XLM-R via a causal language modeling objective. Prix-LM integrates useful multilingual and KB-based factual knowledge into a single model. Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness, with gains reported over strong task-specialised baselines.

翻译：知识基础(KBs)包含大量的结构化世界和普通知识,因此,这些知识基础(KBs)包含大量结构化世界和普通知识基础(KBs),因此,它们往往补充基于分发文本的信息,便利各种下游任务。由于手工构建是资源和时间密集型的,最近的努力试图利用大型预先培训的语言模型(PLM)来为KBs创造更多的单一语言知识事实。然而,没有尝试过建立和丰富多语言知识基础(KBs)的方法。除了更广泛的应用外,这些多语言知识基础(KBs)可以提供比单语言(例如英语)KBs提供更丰富的知识组合知识。以不同语言表达的知识可能是互补和不平等分布的:这意味着高资源语言的现有知识可以转让给低资源群体。要做到这一点,关键是要在共享/统一的空间中代表多语言知识。为此,我们提议了一个统一的框架(Prix-LM),用于多语言构建和完成。我们从现有的多种语言KBbs中提取的两种类型的知识,并且通过一个具有因果关系的、多语言模型化的多语言模型化的多语言基础和跨语言链接化的多语言的多语言链接。