配有排定抽样的语义辅助录像说明模型 (A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling)

Given the features of a video, recurrent neural network can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, which lead to poor performance. Third, current video captioning models are prone to generate relatively short captions, which express video contents inappropriately. Towards resolving these three problems, we make three improvements correspondingly. First of all, we utilize both static spatial features and dynamic spatio-temporal features as input for semantic detection network (SDN) in order to generate meaningful semantic features for videos. Then, we propose a scheduled sampling strategy which gradually transfers the training phase from a teacher guiding manner towards a more self teaching manner. At last, the ordinary logarithm probability loss function is leveraged by sentence length so that short sentence inclination is alleviated. Our model achieves state-of-the-art results on the Youtube2Text dataset and is competitive with the state-of-the-art models on the MSR-VTT dataset.

翻译：鉴于视频的特征,可以使用经常性神经网络来自动生成视频字幕。现有的视频字幕方法至少有三个限制。首先,语义信息被广泛用于提高视频字幕模型的性能,但现有网络往往不能提供有意义的语义特征。第二,教师强化算法常常用于优化视频字幕模型,但在培训和推论期间,应用了不同的战略来指导生成文字,导致不良的性能。第三,目前的视频字幕模型容易产生相对短的字幕,这些字幕表达的视频内容不适当。为了解决这三个问题,我们相应地做了三个改进。首先,我们利用静态空间特征和动态spatio-时空特征作为语识别网络(SDN)的输入,以便产生有意义的语义描述模型(SDN),为视频生成有意义的语义特征。然后,我们提出一个预定的取样战略,将培训阶段从教师指导方式逐步转移到更自学的方式。最后,普通对日志概率损失功能通过句长度来利用,这样短句取缩缩。首先,我们利用静态空间特征和动态空间模型(Mtreaut)实现数据状态。