The progress in Query-focused Multi-Document Summarization (QMDS) has been limited by the lack of sufficient largescale high-quality training datasets. We present two QMDS training datasets, which we construct using two data augmentation methods: (1) transferring the commonly used single-document CNN/Daily Mail summarization dataset to create the QMDSCNN dataset, and (2) mining search-query logs to create the QMDSIR dataset. These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. To cover both these real summary and query aspects, we build abstractive end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets. We also introduce new hierarchical encoders that enable a more efficient encoding of the query together with multiple documents. Empirical results demonstrate that our data augmentation and encoding methods outperform baseline models on automatic metrics, as well as on human evaluations along multiple attributes.
翻译:由于缺少足够的大规模高质量培训数据集,以查询为重点的多文件汇总(QMDS)的进展受到限制。我们提供了两个QMDS培训数据集,我们用两种数据增强方法构建了这两个数据集:(1) 转让常用的单一文件CNN/Daily Mail汇总数据集,以创建QMDSCNN数据集;(2) 采矿搜索查询日志,以创建QMDSNN数据集。这两个数据集具有互补性,即QMDSCNN有真实的摘要,但有模拟查询,而QMDSIR则有真实的查询,但有模拟摘要。为了涵盖这些真实的摘要和查询方面,我们在综合数据集上建立抽象的端对端神经网络模型,在DUC数据集上产生新的状态传输结果。我们还采用了新的等级编码器,使查询能够与多个文件一起更有效地编码。Empicicalal结果显示,我们的数据增强和编码方法超越了自动计量指标的多重属性的基线模型。