In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame $t$ may be entirely unrelated to what is found at that location in frame $t+k$. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers -- trajectory attention -- that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something--Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/Motionformer
翻译:在视频变压器中,时间维度通常以与两个空间维度相同的方式处理。然而,在物体或相机可能移动的场景中,在一个位置在框架框中绘制的物理点图像,$t美元可能与该位置在框架$t+k$中发现的东西完全无关。这些时间通信应建模以便利了解动态场景。为此,我们提议为视频变压器设置一个新的投放区块 -- -- 轨迹关注 -- -- 按隐含确定的运动路径汇总信息。我们还提出了一个新的方法,以解决计算和内存对输入大小的二次依赖性,这对高分辨率或长视频特别重要。虽然这些想法在一系列环境中非常有用,但我们将这些想法应用到视频动作识别的具体任务中,使用变压器模型,并获得关于动电图、某些-Simathing V2和Epic-Kitchens数据集的状态-艺术结果。代码和模型见:https://github.com/facebourseresearch/Motions。