Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint spatial-temporal modeling trades off between the efficiency and performance. While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention. To this end, in this paper, we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. Specifically, for a frame pair, an interactive point is predicted in each frame, serving as a mutual information rich region. By enhancing the features around the interactive point, two frames are implicitly aligned. The aligned features are then pooled into a single token, which is leveraged in the subsequent spatial self-attention. Our method allows eliminating the costly or insufficient temporal self-attention in video. Extensive experiments on benchmarks demonstrate the superiority and generality of our module. Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is released at https://github.com/Francis-Rings/ILA .
翻译:摘要:对比式语言-图像预训练(CLIP)已经在各种图像任务中取得了显著的成功。然而,如何通过有效的时间建模扩展CLIP仍然是一个开放而关键的问题。现有的分解或联合空间-时间建模在效率和性能之间进行权衡。虽然在直通管内建模时间信息被广泛采用,但我们发现简单的帧对齐已经提供了足够的本质,而不需要使用时间注意力。为此,在本文中,我们提出了一种全新的隐式可学习对齐(ILA)方法,它最小化了时间建模的工作量,同时实现了非常高的性能。具体而言,对于帧对,每个帧中预测一个互动点,作为信息丰富的互动区域。通过增强互动点周围的特征,两个帧被隐式对齐。对齐的特征然后被汇总成单个令牌,在随后的空间自我注意中被利用。我们的方法允许在视频中消除昂贵或不足的时间自我注意。对基准测试的大量实验证明了我们模块的优越性和普适性。特别是,所提出的ILA在Kinetics-400上实现了88.7%的top-1准确度,与Swin-L和ViViT-H相比,FLOPs显著减少。代码发布在https://github.com/Francis-Rings/ILA。