With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and visual modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on qualified image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.
翻译:随着多媒体数据迅速增加,出现了大量文献,用于多式联运汇总工作,其中多数目标是从文字和视觉模式中改进突出信息,从文字和视觉模式中精炼突出信息,用最相关的图像制作图片摘要。现有方法主要侧重于提取或抽象汇总,并依靠合格的图像字幕来建立图像参考。我们是第一个在BART、UniMS上提出多式汇总基础统一框架,将采掘和抽象目标结合起来,并选择图像输出。特别是,我们采用从视觉语言预先培训模型中提取知识,以改进图像选择,从而避免对图像说明的存在和质量的任何要求。此外,我们引入了视觉导解码器,以更好地将文字和视觉模式整合起来,指导抽象文本生成。结果显示,我们的最佳模型在大规模基准数据集上取得了新的状态结果。新引入的采掘目标和知识蒸馏技术已证明能够显著改进多式汇总任务。