In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
翻译:在长期的文件控制总和中,在标签数据稀少的情况下,经过预先培训的模型努力适应任务,并有效回应用户的询问。在本文中,我们引入了Scracy预培训,这是一个由问题驱动的、不受监督的预培训目标,专门用来改进概括任务的可控性。通过培训一个在特定情况下生成和回答相关问题的模型,Scracy预培训使模型能够更有效地遵守用户提供的查询,并确定需要归纳的相关内容。我们通过对两个汇总域、短故事和对话以及多个控制战略(关键词、问题和事实类QA配对)的广泛实验,展示了这一方法的有效性。我们的预培训方法仅依赖于未贴标签的文件和问题生成系统,并超越了使用额外监管数据的预调整方法。此外,我们的结果显示,Scogni预培训削减了特定任务的一半的标签数据要求,更忠实于用户提供的查询,并实现了QMSum和SUALIT的最新业绩。