Neural abstractive summarization methods often require large quantities of labeled training data. However, labeling large amounts of summarization data is often prohibitive due to time, financial, and expertise constraints, which has limited the usefulness of summarization systems to practical applications. In this paper, we argue that this limitation can be overcome by a semi-supervised approach: consistency training which is to leverage large amounts of unlabeled data to improve the performance of supervised learning over a small corpus. The consistency regularization semi-supervised learning can regularize model predictions to be invariant to small noise applied to input articles. By adding noised unlabeled corpus to help regularize consistency training, this framework obtains comparative performance without using the full dataset. In particular, we have verified that leveraging large amounts of unlabeled data decently improves the performance of supervised learning over an insufficient labeled dataset.
翻译:然而,由于时间、资金和专门知识的限制,将大量汇总数据贴上标签往往令人望而却步,这限制了汇总系统的实用性,使汇总系统的实用性仅限于实际应用。在本文中,我们争论说,这一局限性可以通过半监督办法克服:一致性培训,即利用大量未贴标签的数据来利用大量未贴标签的数据来改善在小型材料上监督学习的绩效。一致性规范化半监督学习可以使模型预测规范化,使之与投入品中应用的小噪音无关。通过添加未贴标签的未贴标签材料帮助规范一致性培训,这一框架在不使用完整数据集的情况下取得了比较性能。特别是,我们已经核实,利用大量未贴标签数据可体面地改善监督学习在未贴标签的数据集上的绩效。