In multi-document summarization (MDS), the input is a cluster of documents, and the output is the cluster summary. In this paper, we focus on pretraining objectives for MDS. Specifically, we introduce a simple pretraining objective of choosing the ROUGE-based centroid of each document cluster as a proxy for its summary. Our objective thus does not require human written summaries and can be used for pretraining on a dataset containing only clusters of documents. Through zero-shot and fully supervised experiments on multiple MDS datasets, we show that our model Centrum is better or comparable to a state-of-the-art model. We release our pretrained and finetuned models at https://github.com/ratishsp/centrum.
翻译:在多文件摘要(MDS)中,输入是一组文件,产出是分组摘要。在本文中,我们侧重于MDS的培训前目标。具体地说,我们引入了一个简单的培训前目标,即选择每个文件组的基于ROUGE的中间体作为摘要的替代物。因此,我们的目标不需要人文书面摘要,可用于对仅包含一组文件的数据集进行预先培训。通过对多个MDS数据集进行零射和全面监督的实验,我们显示我们的模型Centrum优于或可与最先进的模型相比。我们在https://github.com/ratishsp/centrum上公布了我们预先培训和经过微调的模型。