Knowledge-intensive NLP tasks can benefit from linking natural language text with facts from a Knowledge Graph (KG). Although facts themselves are language-agnostic, the fact labels (i.e., language-specific representation of the fact) in the KG are often present only in a few languages. This makes it challenging to link KG facts to sentences in languages other than the limited set of languages. To address this problem, we introduce the task of Multilingual Fact Linking (MFL) where the goal is to link fact expressed in a sentence to corresponding fact in the KG, even when the fact label in the KG is not available in the language of the sentence. To facilitate research in this area, we present a new evaluation dataset, IndicLink. This dataset contains 11,293 linked WikiData facts and 6,429 sentences spanning English and six Indian languages. We propose a Retrieval+Generation model, ReFCoG, that can scale to millions of KG facts by combining Dual Encoder based retrieval with a Seq2Seq based generation model which is constrained to output only valid KG facts. ReFCoG outperforms standard Retrieval+Re-ranking models by 10.7 pts in Precision@1. In spite of this gain, the model achieves an overall score of 52.1, showing ample scope for improvement in the task.ReFCoG code and IndicLink data are available at https://github.com/SaiKeshav/mfl
翻译:知识密集型 NLP 任务可以通过将自然语言文本与知识图(KG)中的事实联系起来而受益。 虽然事实本身是语言不可知的,但KG中的事实标签(即具体语言的事实说明)往往只以几种语言出现。这使得将KG事实与语言范围有限的语言以外的语言判决联系起来具有挑战性。为了解决这一问题,我们引入了多语言事实链接(MFL)任务,其目标是将句子中表达的事实与KG中的相应事实联系起来,即使KG中的事实标签没有以该句中的语言提供。为了便利这方面的研究,我们提出了一个新的评价数据集(即具体语言对事实的描述)。这个数据集包含11 293个连接的WikiData事实和6 429个跨越英语和六种印度语言的句子。我们提出一个“Retrerereival+Generation” 模型,通过将基于Dore Encoder 的检索与基于Seq2Seq 的生成模型结合起来,该模型只能输出KGFCS 5C+ trainal 数据范围。 Reformagistrational deal deal develasismal develasisax, 在10 Grefilgill sildal develismissional sildaldaldalmal sildal silmal. ReFogis. Refals. Refaldaldal 中, ReFogis, ex ex ex