Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle this limitation, we propose the Memory-Attended Recurrent Network (MARN) for video captioning, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Thus, our model is able to achieve a more comprehensive understanding for each word and yield higher captioning quality. Furthermore, the built memory structure enables our method to model the compatibility between adjacent words explicitly instead of asking the model to learn implicitly, as most existing models do. Extensive validation on two real-word datasets demonstrates that our MARN consistently outperforms state-of-the-art methods.
翻译:视频字幕的典型技术遵循编码器解码器框架,该框架只能侧重于正在处理的一个源视频。这种设计的潜在缺点是它无法捕捉到在培训数据中不止一个相关视频中出现的一个单词的多重视觉背景信息。为了应对这一限制,我们提议为视频字幕建立内存经常网(MARN),在其中设计一个记忆结构,以探索一个单词与其在培训数据中不同视频的类似视觉背景之间的完整频谱对应。因此,我们的模型能够对每个单词实现更全面的理解,并产生更高的字幕质量。此外,构建的记忆结构使我们能够以各种方法模拟相邻词的兼容性,而不是像大多数现有模型那样,要求模型隐含地学习。对两个真实词数据集的广泛验证表明,我们的MARN始终超越了最先进的方法。