Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document. Since most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy, different labeling algorithms have been proposed to extrapolate oracle extracts for model training. In this work, we identify two flaws with the widely used greedy labeling approach: it delivers suboptimal and deterministic oracles. To alleviate both issues, we propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels. We define a new learning objective for extractive summarization which incorporates learning signals from multiple oracle summaries and prove it is equivalent to estimating the oracle expectation for each document sentence. Without any architectural modifications, the proposed labeling scheme achieves superior performance on a variety of summarization benchmarks across domains and languages, in both supervised and zero-shot settings.
翻译:抽取总和通过识别和拼接文件中最重要的句子而产生摘要。 由于大多数总和数据集并不带有黄金标签,表明文件句子是否适合摘要,因此建议了不同的标签算法来推断模型培训的外推法或魔掌摘录。在这项工作中,我们找出广泛使用的贪婪标签方法的两个缺陷:它提供亚优和确定性标志。为了缓解这两个问题,我们提议了一个简单而有效的标签算法,以创建软的、基于期望的句子标签。我们为采掘总和确定了一个新的学习目标,其中包括从多个神器摘要中学习信号,并证明它相当于对每个文件句子的预期值的估计。在没有任何建筑修改的情况下,拟议的标签办法在受监督的和零光照的环境下,在各个领域和语言之间,在各种总和基准上取得了优异的绩效。