In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.
翻译:在本文中,用自然语言描述视频序列的视觉内容的问题得到了解决。与以往主要利用视频内容的提示制作语言描述的视频字幕工作不同,我们提议建立一个重建网络(RecNet),其结构是一个小的编码器-脱coder-Reconuctor 结构,它利用前方(视频到句)和后向(视频到视频)流作为视频字幕。具体地说,编码器-编码器利用前方流来生成基于编码视频语义特征的句子描述。两种重建器是定制的,以使用后向流并复制基于解码器生成的隐藏状态序列的视频特征。由编码器-解码器生成的生成的生成损失和重建器引入的重建损失被联合引到以端到端的方式对拟议的 RecNet进行的培训中。基准数据集的实验结果表明,拟议的重建器能够增强解码器模型,并导致视频字幕描述准确性的重大收益。