Extractive summarization systems are known to produce poorly coherent and, if not accounted for, highly redundant text. In this work, we tackle the problem of summary redundancy in unsupervised extractive summarization of long, highly-redundant documents. For this, we leverage a psycholinguistic theory of human reading comprehension which directly models local coherence and redundancy. Implementing this theory, our system operates at the proposition level and exploits properties of human memory representations to rank similarly content units that are coherent and non-redundant, hence encouraging the extraction of less redundant final summaries. Because of the impact of the summary length on automatic measures, we control for it by formulating content selection as an optimization problem with soft constraints in the budget of information retrieved. Using summarization of scientific articles as a case study, extensive experiments demonstrate that the proposed systems extract consistently less redundant summaries across increasing levels of document redundancy, whilst maintaining comparable performance (in terms of relevancy and local coherence) against strong unsupervised baselines according to automated evaluations.
翻译:据了解,抽取总结系统产生不一致的、如果没有说明的话的高度冗余文本。在这项工作中,我们处理未经监督的对长期、高度冗余文件的抽取总结中的简易冗余问题。为此,我们利用一种人类阅读理解的心理语言学理论,直接模拟当地的一致性和冗余。运用这一理论,我们的系统在建议层面运作,利用人类记忆表达的特性,将一致和不重复的类似内容单位排位,从而鼓励提取较少冗余的最后摘要。由于摘要长度对自动措施的影响,我们通过将内容选择确定为最优化问题,在所检索信息的预算方面实行软性限制。利用科学文章的汇总作为案例研究,广泛的实验表明,拟议的系统在不断提高的文件冗余程度方面,总是以较少的冗余摘要为依据,同时根据自动化评价,在强的不可靠的基线上保持类似性业绩(在相关性和地方一致性方面) 。