Producing a reduced version of a source text, as in generic or focused summarization, inherently involves two distinct subtasks: deciding on targeted content and generating a coherent text conveying it. While some popular approaches address summarization as a single end-to-end task, prominent works support decomposed modeling for individual subtasks. Further, semi-automated text reduction is also very appealing, where users may identify targeted content while models would generate a corresponding coherent summary. In this paper, we focus on the second subtask, of generating coherent text given pre-selected content. Concretely, we formalize \textit{Controlled Text Reduction} as a standalone task, whose input is a source text with marked spans of targeted content ("highlighting"). A model then needs to generate a coherent text that includes all and only the target information. We advocate the potential of such models, both for modular fully-automatic summarization, as well as for semi-automated human-in-the-loop use cases. Facilitating proper research, we crowdsource high-quality dev and test datasets for the task. Further, we automatically generate a larger "silver" training dataset from available summarization benchmarks, leveraging a pretrained summary-source alignment model. Finally, employing these datasets, we present a supervised baseline model, showing promising results and insightful analyses.
翻译:生成源文本的缩写版本,如通用或重点汇总,必然涉及两个截然不同的子任务:决定目标内容和生成一个一致的文本来传达它。虽然一些流行的方法将总和作为单一端到端的任务处理,但突出的工作支持对单个子任务进行分解模型。此外,半自动文本减少也非常令人着迷,用户可以确定目标内容,而模型则产生相应的一致摘要。在本文中,我们侧重于第二个子任务,根据预选的内容生成一致的文本。具体地说,我们正式确定一个独立的任务,其投入是具有明确目标内容范围(“亮光”)的源文本。然后一个模型需要生成一个包含所有目标信息的统一文本。我们主张这些模型的潜力,既包括模块全自动的组合,也包括半自动的人类行内操作使用案例。我们为适当的研究提供大量高质量的数据折叠式和测试数据集,作为独立的单项任务(“亮亮度”),我们从任务中自动生成一个更大的数据基准,我们用这些基准来显示一个有希望的数据。