Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.
翻译:先前关于将知识图(KG)三重转换成自然文本的工作,即将知识图(KG)三重转换成自然文本的任务,重点是特定领域的基准数据集。然而,在本文中,我们对整个英文维基数据(KG)进行口头陈述,并讨论与广泛、开放的、大规模的口头表述相关的独特挑战。我们进一步表明,如维基数据(Wikigata)这样的全面、百科全书(KG)可被用于整合结构化的KGs和自然语言。与为整合这两个来源而开发的许多结构不同,我们的方法是将KG转换成自然文本,使之与现有的语言模型无缝地融合。它具有提高实际准确性和降低所产生的语言模型毒性的进一步优势。我们通过在检索语言模型中增加检索内容和显示开放域QA和LAM知识探测的知识密集型任务的重大改进来评估这一方法。