Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers' abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.
翻译:就一个研究专题撰写一份调查文件通常需要涵盖许多相关文件的突出内容,这些相关文件可以模拟为多文件总结(MDS)任务。现有的MDS数据集通常侧重于制作包含一些投入文件的无结构摘要。与此同时,先前的结构化摘要生成工作侧重于将单一文件归纳为多部分摘要。这些现有的数据集和方法无法满足将众多学术文件总结为结构化摘要的要求。为了处理现有数据稀缺的问题,我们提议BigSuvey,这是编制关于每个专题的众多学术文件综合摘要的第一个大型数据集。我们从7 000多份调查文件中收集目标性摘要,并将其4万份参考文件摘要用作投入文件。为了组织来自几十份投入文件的不同内容,并确保长文本序列处理的效率,我们建议了一个名为分类对齐和分散变异器的汇总方法。实验结果表明,我们的CAST方法比各种先进的总结方法要好得多。