Our analysis of large summarization datasets indicates that redundancy is a very serious problem when summarizing long documents. Yet, redundancy reduction has not been thoroughly investigated in neural summarization. In this work, we systematically explore and compare different ways to deal with redundancy when summarizing long documents. Specifically, we organize the existing methods into categories based on when and how the redundancy is considered. Then, in the context of these categories, we propose three additional methods balancing non-redundancy and importance in a general and flexible way. In a series of experiments, we show that our proposed methods achieve the state-of-the-art with respect to ROUGE scores on two scientific paper datasets, Pubmed and arXiv, while reducing redundancy significantly.
翻译:我们对大型汇总数据集的分析表明,在总结长篇文件时,冗余是一个非常严重的问题。然而,在神经总结中,没有彻底调查裁员问题。在这项工作中,我们系统地探讨和比较在总结长篇文件时处理冗余的不同方法。具体地说,我们根据何时和如何考虑冗余情况,将现有方法分为几类。然后,在这些类别中,我们提议了另外三种方法,以一般和灵活的方式,平衡不冗余和重要性。在一系列实验中,我们表明我们提出的方法在Pubmed和Arxiv两个科学纸数据集的ROUGE分数方面达到了最新水平,同时大大减少冗余现象。