Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our visual guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.
翻译:为了填补这一研究空白,我们打算研究两个研究问题:(1) 如何将视觉信息输入GPLMS,而同时又不损害其生成能力;(2) 在GPLMS中,哪些地方是提供视觉信息的最佳地点?在本文中,我们提出了一个简单而有效的方法,用以利用基于注意的附加层,将视觉信息纳入到MAMS的任务中,同时保持其原始的文字生成能力。结果显示,我们的最佳模型大大超过以前最先进的模型,即5.7 ROUGE-1、5.3 ROUGE-2和5.1 ROUGE-L关于“HO2”数据集的评分,以及我们的视觉指导方法为总体改进提供了83.6%。