The summarization literature focuses on the summarization of news articles. The news articles in the CNN-DailyMail are relatively short documents with about 30 sentences per document on average. We introduce SciBERTSUM, our summarization framework designed for the summarization of long documents like scientific papers with more than 500 sentences. SciBERTSUM extends BERTSUM to long documents by 1) adding a section embedding layer to include section information in the sentence vector and 2) applying a sparse attention mechanism where each sentences will attend locally to nearby sentences and only a small number of sentences attend globally to all other sentences. We used slides generated by the authors of scientific papers as reference summaries since they contain the technical details from the paper. The results show the superiority of our model in terms of ROUGE scores.
翻译:在CNN-DailyMail的新闻报道是较短的文件,每份文件平均有大约30个句子。我们介绍SciboutsUM,这是我们为总结长篇文件(如科学论文,有500多条判决)而设计的总结框架。Sciboutsum将BERTSUM扩大到长篇文件,办法是:(1) 增加一个嵌入层,将部分信息包含在句子矢量中;(2) 运用一个微薄的注意机制,每一句子将在当地处理附近的判决,而全球范围内只有少量判决将所有其他判决作为参考摘要。我们使用科学论文作者制作的幻灯片作为参考摘要,因为它们载有论文的技术细节。结果显示我们的模型在ROUGE分数方面的优势。