Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit neighboring feature differences to obtain motion clues for short-term temporal modeling with a simple convolution. However, only one local convolution is incapable of handling various kinds of actions because of the limited receptive field. Besides, action-irrelated noises brought by camera movement will also harm the quality of extracted motion features. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-perception Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through spatial-level local-global motion modeling, where the saliency alignment and pyramidal motion modeling are conducted successively between adjacent frames to capture motion dynamics with fewer noises caused by misaligned background. CTI is designed to perform multi-perception temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different perceptions are integrated with the attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness of our proposed method.
翻译:在视频动作识别方面,一个重要而又具有挑战性的问题就是高效的时空模型; 现有最先进的方法利用周边特征差异,获得运动线索,以进行短期时间模型的短期模拟; 然而,由于有限的可接受场,只有一个本地变迁无法处理各种行动。 此外,摄影机移动带来的与行动有关的噪音也会损害提取运动功能的质量。 在本文件中,我们提议了一个时温调调调调集块,主要包含一个高调调调调解调(SME)模块和一个跨视点101时间整合模块。 具体地说,中小企业的目标是通过空间层面的本地-全球运动模型突出运动敏感区域,在相邻的框架之间相继进行显著的对齐和金字塔运动模型,以较少的噪音捕捉运动动态。 CTI旨在通过一组单独的1D演变组合进行多重感知度时间模型的模拟。 同时,不同认识之间的时间互动与不同概念-101时间互动(CTI)模块,通过两个高效的基调模式,通过两个基调的基调1, 长期的基调关系,通过两个基调1 展示一些新的基调基准。