The goal of multimodal abstractive summarization (MAS) is to produce a concise summary given the multimodal data (text and vision). Existing studies on MAS mainly focus on how to effectively use the extracted visual features, having achieved impressive success on the high-resource English dataset. However, less attention has been paid to the quality of the visual features to the summary, which may limit the model performance especially in the low- and zero-resource scenarios. In this paper, we propose to improve the summary quality through summary-oriented visual features. To this end, we devise two auxiliary tasks including \emph{vision to summary task} and \emph{masked image modeling task}. Together with the main summarization task, we optimize the MAS model via the training objectives of all these tasks. By these means, the MAS model can be enhanced by capturing the summary-oriented visual features, thereby yielding more accurate summaries. Experiments on 44 languages, covering mid-high-, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach, which achieves state-of-the-art performance under all scenarios.
翻译:多式联运抽象汇总(MAS)的目标是根据多式联运数据(文本和愿景)编写一份简明摘要。关于MAS的现有研究主要侧重于如何有效利用提取的视觉特征,在高资源英文数据集方面取得了令人印象深刻的成功。然而,对摘要的视觉特征的质量重视较少,这可能会限制模型的性能,特别是在低资源情景和零资源情景中。在本文件中,我们提议通过以摘要为导向的视觉特征改进概要质量。为此,我们设计了两项辅助任务,包括:对摘要任务的解读;和成形图像建模任务。我们与主要的合成任务一道,通过所有这些任务的培训目标优化MAS模式。通过这些手段,可以通过捕捉以摘要为导向的视觉特征,从而产生更准确的概要,加强MAS模式。对44种语言的实验,涵盖中高、低和零资源情景,检验拟议方法的有效性和优越性,在所有情景下实现最先进的绩效。