We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that our transformer-based network has a natural capacity to learn motion in self-supervised settings and achieves strong performance, outperforming CVRL on four benchmarks.
翻译:我们提出SCVRL,这是一个以自我监督为主的视频学习新颖的对比式框架。 不同于以往主要侧重于学习视觉语义学(如CVRL)的对比式学习方法,SCVRL能够同时学习语义学和运动模式。 为此,我们将在现代对比式学习范式中重新配置大众打乱的托辞任务。 我们显示,基于变压器的网络具有自然能力,可以在自我监督的环境中学习运动,并取得优异的性能,在四个基准上优于CVRL。