Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.
翻译:由于视频中的不同模式含有补充信息,我们提议建立一个多式语义注意网(MSAN),这是一个包含多式语义说明的新的编码器解码器框架,其中包含了用于视频字幕的多式语义特性。在编码阶段,我们通过将它作为一个多标签分类问题来检测和生成多式语义特性。此外,我们还将辅助分类损失添加到我们的模型中,以获得更有效的视觉特征和高级多式联运语义属性分布,从而实现足够的视频编码。在解码阶段,我们将常规LSTM的每个重量矩阵扩展至一个依赖属性的重量矩阵组合,并采用关注机制,在每次说明过程的每个时间关注不同属性。我们对两种受欢迎的公共基准:MSVD和MSR-VTT进行算法评估,在六个评价指标中与当前的最新技术取得竞争性结果。