Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention using nearly the same computation and memory cost as 2D self-attention. TPS is a plug-and-play module and can be inserted into existing 2D transformer models to enhance spatiotemporal feature learning. The proposed method achieves competitive performance with state-of-the-arts on Something-something V1 & V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost. The source code of TPS can be found at https://github.com/MartinXM/TPS.
翻译:在基于 2D 图像的视觉任务方面,基于变异器的方法最近取得了巨大的进步。对于基于 3D 的视频任务,例如行动识别等,直接在视频数据上应用波片时变压器将带来沉重的计算和记忆负担,因为修补和自留计算方式的四边复杂程度大大增加。如何高效和有效地建模3D 自留视频数据对变压器来说是一个巨大的挑战。在本文中,我们提议在视频行动识别的变压器中,为高效的 3D 自留模型采用Temal Patch Shift (TPS) 方法。在时间尺度上,直接应用带有特定马赛模式的补丁部分,从而将香草空间自留操作转换成一个随机多时空操作器。结果是,我们可以使用与 2D 自留调几乎相同的计算和记忆成本。TPS 是一个插和游戏模块,可以插入现有的 2D 变换模型中,用特定马赛模式在时间尺度上进行学习。提议的V-48 智能计算方法在V- hestal 和V- hestals 上可以找到更多的Sy- sex- ex- salpreal2 。