This paper presents an unsupervised extractive approach to summarize scientific long documents based on the Information Bottleneck principle. Inspired by previous work which uses the Information Bottleneck principle for sentence compression, we extend it to document level summarization with two separate steps. In the first step, we use signal(s) as queries to retrieve the key content from the source document. Then, a pre-trained language model conducts further sentence search and edit to return the final extracted summaries. Importantly, our work can be flexibly extended to a multi-view framework by different signals. Automatic evaluation on three scientific document datasets verifies the effectiveness of the proposed framework. The further human evaluation suggests that the extracted summaries cover more content aspects than previous systems.
翻译:本文件介绍了一种未经监督的采掘方法,根据信息瓶颈原则对长长的科学文件进行总结。在以往使用信息瓶颈原则进行句子压缩的工作的启发下,我们将其扩展为以两个不同步骤对文件水平进行总结。第一步,我们使用信号查询源文件的关键内容。然后,经过预先培训的语言模式进行进一步的句子搜索和编辑,以归还最后摘录的摘要。重要的是,我们的工作可以通过不同的信号灵活地扩展到多视角框架。对三个科学文件数据集的自动评估可以核实拟议框架的有效性。进一步的人类评估表明,所提取的摘要的内容方面比以往系统要多。