Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.
翻译:视频标题是指自动生成一个特定短视频短片的描述性句子,该短视频短片最近取得了显著的成功。然而,大多数现有方法更侧重于视觉信息,而忽略同步的音频提示。我们提出了三种多式深度聚合战略,以尽量扩大视觉-视听共振信息的好处。第一个是探讨对跨模式特征从低到高顺序融合的影响。第二个是通过共享相应的前端网络的重量来确定视觉-视听短期依赖性。第三个是通过在视觉和音频模式之间共享多式联运记忆,将时间依赖性扩大到长期。广泛的实验验证了我们三个基准数据集的交叉模式融合战略的有效性,包括微软研究视频到文字(MSVTTT)和微软视频描述(MSVD)。值得一提的是,共享权重可以有效地协调视觉-音频混集,并实现BELU和METEOR指标的状态性能。此外,我们首先提出一个动态的多式聚合框架,以处理缺少部分模式的案例。实验结果显示,即使在音频模式中,我们还可以获得更多的无音频模式。