Neural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two steps: extractive summarization and paraphrasing. We extract major parts of an input text in the extractive summarization step, and obtain its diverse expressions with the paraphrasing step. Through experiments, we show that ExtraPhrase improves the performance of abstractive summarization tasks by more than 0.50 points in ROUGE scores compared to the setting without data augmentation. ExtraPhrase also outperforms existing methods such as back-translation and self-training. We also show that ExtraPhrase is significantly effective when the amount of genuine training data is remarkably small, i.e., a low-resource setting. Moreover, ExtraPhrase is more cost-efficient than the existing approaches.
翻译:利用大量平行数据培训的神经模型在抽象总结任务中取得了令人印象深刻的成绩。然而,大型平行公司成本昂贵,而且难以构建。在这项工作中,我们引入了低成本和有效的战略,即ExtraPhrase,以扩大抽象总结任务的培训数据。外法系统分两步构建假培训数据:采掘总和和参数学。我们在采掘总和步骤中提取了输入文本的主要部分,并获得了与参数转换步骤的不同表达方式。通过实验,我们显示外法系统将抽象总结任务的业绩比不增加数据的设置提高了0.50分以上。外法系统还超越了现有的方法,如回转换和自我培训。我们还表明,当真正的培训数据数量非常小,即低资源环境时,外法系统非常有效。此外,外法系统比现有方法的成本效率更高。