Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
翻译:以自然语言自动描述视频是计算机视觉和自然语言处理的基本挑战。最近,这个问题的进展是通过两个步骤取得的:(1) 使用2D和/或3D进化神经网络(CNNs)(例如VGG、ResNet或C3D)来提取空间和/或时间特征以编码视频内容;(2) 应用经常性神经网络(RNNS)来生成视频描述事件的判决。时间关注模型通过考虑每个视频框架的重要性而取得了很大进展。然而,对于一个长视频,特别是由一组次活动组成的视频,我们应该发现并利用每个子集的重要性,而不是每个框架。在本文中,我们提出了一个新的方法,即时间和空间LSTM(TS-LSTM),系统地利用视频序列中的空间和时间动态来生成描述事件。在TS-LSTM(TP-LSTM)中,一个基于时间和时间的集合LSTM(TP-LSTM)模型的设计是结合空间和时间信息,以在视频分集中提取长期时间动态动态,在视频分片次活动中,我们将发现并利用LTM(TTM)的两套图像列表显示结果。