Document summarization provides an instrument for faster understanding the collection of text documents and has several real-life applications. With the growth of online text data, numerous summarization models have been proposed recently. The Sequence-to-Sequence (Seq2Seq) based neural summarization model is the most widely used in the summarization field due to its high performance. This is because semantic information and structure information in the text is adequately considered when encoding. However, the existing extractive summarization models pay little attention to and use the central topic information to assist the generation of summaries, which leads to models not ensuring the generated summary under the primary topic. A lengthy document can span several topics, and a single summary cannot do justice to all the topics. Therefore, the key to generating a high-quality summary is determining the central topic and building a summary based on it, especially for a long document. We propose a topic-aware encoding for document summarization to deal with this issue. This model effectively combines syntactic-level and topic-level information to build a comprehensive sentence representation. Specifically, a neural topic model is added in the neural-based sentence-level representation learning to adequately consider the central topic information for capturing the critical content in the original document. The experimental results on three public datasets show that our model outperforms the state-of-the-art models.
翻译:文件总和提供了一种工具,可以更快地理解文本文件的收集,并具有若干实际应用。随着在线文本数据的增长,最近提出了许多摘要模型。基于序列到序列(Seq2Seq)的神经总和模型由于其高性能,是总和领域最广泛使用的模式。这是因为文本中的语义信息和结构信息在编码时得到充分考虑。然而,现有的抽取总和模型很少注意并利用中央专题信息协助生成摘要,从而导致模型无法确保生成主要专题下的概要。长篇文件可以涵盖多个专题,而单一摘要不能对所有专题都公正。因此,产生高质量摘要的关键是确定中心专题,并在此基础上建立摘要模型,特别是一份长篇文件。我们建议对文件总和进行专题编码,以处理这一问题。这一模型有效地将综合实践层面和专题级信息结合起来,以构建一个全面的句式表述。具体地说,一个神经模型可以覆盖多个专题,而单一摘要不能对所有专题都公正。因此,生成高质量的摘要的关键摘要是确定一个核心文件模型,从而在核心数据库中充分体现我们的核心数据模型。