Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, background noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through local-global motion modeling, where the background suppression and pyramidal feature difference are conducted successively between neighboring frames to capture motion dynamics with less background noises. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different scales are integrated with attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something v1 & v2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness and superiority of our proposed method.
翻译:在视频动作识别方面,一个重要而又具有挑战性的问题就是高效的时空建模。现有最先进的方法利用运动线索,通过连续框架的时间差异来帮助短期时间建模。然而,由于相机的移动,背景噪音将不可避免地被引入。此外,不同行动的移动可能大不相同。在本文件中,我们建议建立一个时温调调调调聚集块,主要包含一个高调感应(SME)模块和一个跨尺度的时空整合模块。具体地说,中小企业的目标是通过地方-全球运动建模来突出运动敏感领域,即背景抑制和金字塔特征差异在邻近框架之间相继进行,以背景噪音较少的方式捕捉运动动态。CTI旨在分别通过一组单独的1D演进进行多尺度的时间建模。与此同时,不同尺度之间的时间互动与关注机制融合在一起。通过这两个模块,可以通过引入有限的额外参数,有效地对长期的短期时间关系进行编码。在几个受欢迎的基准上进行了广泛的实验(i.e.max-4-max-max-maxial 和Kin-matical-marialis-forizal vical iv2。