Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for \emph{low resource (LR) languages} a critical problem. Existing work on Wikipedia text generation has focused on \emph{English only} where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose \task{}, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, \data{}, spanning $\sim$69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.
翻译:缺乏百科全书文本贡献者,尤其是在维基百科上,使得低资源语言的自动化文本生成成为了一个关键问题。现有的维基百科文本生成工作仅关注于仅有英文的文本生成,其中英文参考文章已经被总结成英文维基百科页面。但是,对于低资源语言,因为参考文章的稀缺性,单语言总结并不能有效地解决这个问题。因此,在这项工作中,我们提出了 \task{}。该任务是将来自不同语言的多个参考文章的跨语言多文档摘要生成类似于维基百科文本的任务。因此,我们贡献了一个基准数据集 \data{},跨度大约为 69K 维基百科文章,覆盖了五个领域和八种语言。我们利用这个数据集训练了一个具有两个阶段的系统,其中输入是一组引用和一个章节标题,输出是针对该特定章节的低资源摘要。所提出的系统基于一种新颖的神经无监督提取总结,以粗略地确定显着信息,然后使用神经抽象模型生成章节特定文本。广泛的实验表明,多领域培训比多语言设置平均更好。