We present a new dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data. Our new dataset WikiGraphs is collected by pairing each Wikipedia article from the established WikiText-103 benchmark (Merity et al., 2016) with a subgraph from the Freebase knowledge graph (Bollacker et al., 2008). This makes it easy to benchmark against other state-of-the-art text generative models that are capable of generating long paragraphs of coherent text. Both the graphs and the text data are of significantly larger scale compared to prior graph-text paired datasets. We present baseline graph neural network and transformer model results on our dataset for 3 tasks: graph -> text generation, graph -> text retrieval and text -> graph retrieval. We show that better conditioning on the graph provides gains in generation and retrieval quality but there is still large room for improvement.
翻译:我们提出了一套新的维基百科文章数据集,每篇文章配上一个知识图表,以便利在有条件的文本生成、图表生成和图表演示学习方面开展研究。现有的图形文本配对数据集通常包含小图表和短文本(1或几句),从而限制了数据中可以学习的模型能力。我们的新数据集Wiki Graphs是通过从既定的WikiText-103基准(Merity等人,2016年)和自由基础知识图(Bollacker等人,2008年)的子图(Bollacker等人,2008年)配对而收集的,从而便于参照其他最先进的能够生成长段落一致性文本的文本基因化模型。与以往的图形配对数据集相比,这些图表和文本数据的规模要大得多。我们为我们3项任务的数据集(图表 - > 文本生成、图表 - > 文本检索和文本 > 图表检索)的基线图形网络和变换模型结果提供了更好的改进空间。我们显示,在图形上更好的调整提供了生成和检索质量的收益,但是仍有很大的改进空间。