Abstractive document summarization is usually modeled as a sequence-to-sequence (Seq2Seq) learning problem. Unfortunately, training large Seq2Seq based summarization models on limited supervised summarization data is challenging. This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text. The main idea is that, given an input text artificially constructed from a document, a model is pre-trained to reinstate the original document. These objectives include sentence reordering, next sentence generation, and masked document generation, which have close relations with the abstractive document summarization task. Experiments on two benchmark summarization datasets (i.e., CNN/DailyMail and New York Times) show that all three objectives can improve performance upon baselines. Compared to models pre-trained on large-scale data (more than 160GB), our method, with only 19GB text for pre-training, achieves comparable results, which demonstrates its effectiveness.
翻译:抽象文件摘要化通常以序列到序列(Seq2Seq)的学习问题为模型。 不幸的是,在有限监督的汇总数据上培训大型Seq2Seq基于的汇总模型具有挑战性。本文件介绍了三个培训前目标,使我们能够在未贴标签的文本上对基于Seq2Seq的抽象汇总模型进行预培训。主要想法是,鉴于一个从文件人工制作的输入文本,一个模型经过预先训练才能恢复原始文件。这些目标包括重排句次、下一代句子和隐藏文件生成,这些与抽象文件汇总任务有密切关系。关于两个基准汇总数据集(即CNN/DailyMail和New York Times)的实验表明,所有三个目标都能提高基线的性能。与大规模数据(超过160GB)预先培训的模型相比,我们的方法(只有19GB的预培训文本),取得了相似的结果,表明其有效性。