Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.
翻译:首先,我们试图同时获得效率和有效性。首先,除了传统地将H x W x T视频框作为时空信号(从Hight-Width空间平面上观看)外,我们还提议从另外两架高度时空和Width时空飞机上模拟视频,以彻底捕捉视频的动态。第二,我们的模型是以2DCNN的骨干和模型复杂性设计的。具体地说,我们引入了一个新的多视聚变模块,以利用视频动态,同时将之作为时间信号(从Height-Width空间平面上观看),我们提议从另外两架高度时空和Width时空飞机上模拟视频,以便彻底捕捉视频的动态。此外,MVFNet可以想象,基于2DCNN的骨干和模型复杂度设计。具体地说,我们引入了一个新的多视聚合模块(MVF)模块来利用视频动态作为空间时空信号(从H x x ) 。这是一个插和游戏模块,可以插入到离场的2DCNN, 来形成一个简单而有效的模式。MVFNet。此外,MFNet可以想象到一个通用的通用的模型框架, 和S-S-S-S-S-S-s-s-s-modrode-lax-lax-lax-laxxxxxxxxx-s-s-s-s-s-s-s-S-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s