Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.
翻译:缺乏百科全书文本贡献者,特别是在维基百科上,使得低资源语言的自动化文本生成成为一个关键问题。现有的维基百科文本生成工作仅针对英语,其中英语参考文章被概括以生成英语维基百科页面。但对于低资源语言来说,参考文章的稀缺性使得仅使用单语言摘要无法解决这个问题。因此,在本文中,我们提出了XWikiGen,即跨语言多文档摘要,其将来自多种语言的多个参考文章的文本进行概括,以生成维基百科式的文本。因此,我们提供了一个基准数据集XWikiRef,涉及五个领域和八种语言的约69K维基百科文章。我们利用此数据集训练一个两阶段系统,其中输入是一组引用和章节标题,输出是特定章节的低资源摘要。所提出的系统基于一种新颖的神经非监督提取式摘要思想,用于粗略识别显著信息,然后使用一种神经抽象模型生成特定的文本。广泛的实验表明,在平均水平上,多领域训练比多语言设置更好。