视频变换器的长期短期时间差异性学习 (Long-Short Temporal Contrastive Learning of Video Transformers)

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

翻译：视频变压器最近成为了3DCNN视频理解的竞争性替代方案,但由于其参数数量众多,诱导偏差减少,这些模型需要接受大规模图像数据集的监督前培训,才能达到顶级性能。在本文中,我们从经验上表明,对视频变压器进行仅视频数据集的自我监督前培训,可以导致与大型图像数据集监督前培训相比的承认行动结果平等或更好,甚至像图像网21K这样的大型图像。由于基于变压器的模型能够有效地捕捉较长时间跨度的依赖性,因此我们建议了一个简单的学习程序,将模型与同一视频的短期视图相匹配。我们的方法,名为“长期光学变压变压式学习”(LSTLCL),使视频变压器能够通过预测从更长的时间范围所捕捉到的时间环境背景来学习有效的短期代表。为了显示我们发现的一般性,我们根据三种不同的自我监督反差对比学习框架(MOVV3)实施和验证我们的方法,我们提出一个简单的学习程序程序,一个不同的视频变压变压模型,一个不同的系统前SimStravial-travial Stal Stal Stal Stal Stal Stal Stal Stal Stal imal res一个不同的图像,包括一个不同的系统,一个不同的图像结构,一个清晰的Silvical-traview Staview