Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
翻译:多文件摘要(MDS) 传统上是假定提供一套地面实况专题投入文件而研究的。 实际上,输入文件集不可能事先提供,需要根据信息需要(我们称之为开放域内容MDS的设置)检索。我们试验了在开放域设置中推广的几个流行的MDS数据集上的现有最先进的检索和汇总模型。我们发现,现有的摘要集在应用到这一更现实的任务时,其性能会大大降低,尽管具有检索投入的培训摘要集可以减少其敏感度检索错误。为了进一步探究这些结果,我们对摘要集输入进行扰动实验,以研究不同类型文件检索错误的影响。根据我们的成果,我们提供了实用的指导方针,以帮助向开放域内容MDS转变。我们发布了我们的代码和实验结果,与我们在调查期间创造的所有数据或模型文物一起发布。