In this work, we propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs and automatically determining whether those questions are answered in the gold summaries. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary. Our experiments demonstrate that the models trained using QA-based supervision generate higher-quality summaries than baseline methods of identifying salient spans on benchmark summarization datasets. Further, we show that the content of the generated summaries can be controlled based on which NPs are marked in the input document. Finally, we propose a method of augmenting the training data so the gold summaries are more consistent with the marked input spans used during training and show how this results in models which learn to better exclude unmarked document content.
翻译:在这项工作中,我们提出了一个将问答信号纳入汇总模型的方法。我们的方法通过自动生成由NP回答的字符串问题和自动确定这些问题是否在黄金摘要中回答,在输入文档中确定突出的名词短语。基于QA的信号被纳入一个两阶段汇总模型,该模块首先使用分类模型在输入文档中标出突出的NP,然后有条件地生成摘要。我们的实验表明,使用基于QA的监管所培训的模型产生的质量高于在基准汇总数据集中确定显著跨度的基准方法。此外,我们表明,生成摘要的内容可以控制在输入文档中标出NP的基点。最后,我们建议了一种方法来增加培训数据,使黄金摘要与培训中使用的标记输入范围更加一致,并表明这些结果如何在模型中学会如何更好地排除无标记的文件内容。