Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.
翻译:尽管对静态图像理解的深层学习取得了成功,但仍不清楚在视频中空间-时间建模最有效的网络结构是什么。在本文中,与现有的CNN+RNNN或纯粹的三维进化法相比,我们探索了一个新的空间时间网络(StNet)结构,用于视频中的地方和全球空间-时间建模。特别是,StNet堆叠N连续的视频框架,进入一个有3N频道并应用2D超级图像来捕捉当地空间-时际关系的超级图像。在模拟全球空间-时际关系时际关系中,我们将时间演化适用于当地空间-时空地貌图。具体地说,在StNet中提出了一个新的时间-时间空间网络(StNet)结构。在视频的特征序列中采用了一种单独的频道和时间-时间的演化。在Kinitics数据集上进行的广泛实验表明,我们的框架在行动识别中超越了几种状态-艺术方法,可以在识别模型复杂度之间找到一种令人满意的交易。我们进一步展示了铀-101光谱数据的一般表现。