In this paper, we propose self-supervised training for video transformers using unlabelled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT). Further, owing to the flexibility of Transformer models, SVT supports slow-fast video processing within a single architecture using dynamically adjusted positional encodings and supports long-term relationship modeling along spatiotemporal dimensions. Our approach performs well on four action recognition benchmarks (Kinetics-400, UCF-101, HMDB-51, and SSv2) and converges faster with small batch sizes. Code: https://git.io/J1juJ
翻译:在本文中,我们提议使用未贴标签的视频数据对视频变压器进行自我监督培训。 我们从一个特定视频中创建了具有不同空间大小和框架率的本地和全球时空观点。 我们的自我监督目标旨在匹配这些代表相同视频的不同观点的特征,从而不易发生时空变化。 据我们所知,拟议方法首先减轻对自监督视频变压器中负面样本或专用记忆库的依赖。 此外,由于变压器模型的灵活性,SVT支持在一个单一结构中采用动态调整定位编码进行慢速视频处理,并支持在广频维度维度上建立长期关系模型。我们的方法在四种行动识别基准(Kinetic-400、UCF-101、HMDB-51和SSv2)上表现良好,并且与小批量尺寸的聚合速度更快。代码: https://git.io/J1juJ。