Extracting summaries from long documents can be regarded as sentence classification using the structural information of the documents. How to use such structural information to summarize a document is challenging. In this paper, we propose GoSum, a novel graph and reinforcement learning based extractive model for long-paper summarization. In particular, GoSum encodes sentence states in reinforcement learning by building a heterogeneous graph for each input document at different discourse levels. An edge in the graph reflects the discourse hierarchy of a document for restraining the semantic drifts across section boundaries. We evaluate GoSum on two datasets of scientific articles summarization: PubMed and arXiv. The experimental results have demonstrated that GoSum achieve state-of-the-art results compared with strong baselines of both extractive and abstractive models. The ablation studies further validate that the performance of our GoSum benefits from the use of discourse information.
翻译:使用文件的结构信息,从长篇文件中提取摘要可被视为使用文件的结构信息进行判决分类。如何使用这种结构信息对文件进行总结具有挑战性。在本文中,我们提出GoSum,这是一个用于长纸张总结的新型图表和强化学习型采掘模型,特别是GoSum编码句称,通过在不同讨论级别为每份输入文件绘制一个不同的图表,加强学习。图中的边缘部分反映了限制语义流出文件的谈话等级。我们评估GoSum关于科学文章总结的两个数据集:PubMed和arXiv。实验结果表明,GoSum取得了最新的结果,与强健的采掘和抽象模型的基线相比,两者都具有最先进的效果。 模拟研究进一步证实,我们GoSum的绩效得益于对谈话信息的利用。