Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful loss objectives as long as we can find a reasonable way to formulate positive and negative samples to contrast. However, existing approaches rely heavily on the short-range spatiotemporal salience to form clip-level contrastive signals, thus limit themselves from using global context. In this paper, we propose a new video-level contrastive learning method based on segments to formulate positive pairs. Our formulation is able to capture global context in a video, thus robust to temporal content change. We also incorporate a temporal order regularization term to enforce the inherent sequential structure of videos. Extensive experiments show that our video-level contrastive learning framework (VCLR) is able to outperform previous state-of-the-arts on five video datasets for downstream action classification, action localization and video retrieval. Code is available at https://github.com/amazon-research/video-contrastive-learning.
翻译:对比性学习使自我监督的图像代表学习领域发生了革命性的变化,最近又将其调整为视频领域。对比性学习的最大好处之一是,只要我们能够找到合理的方法来制定正反抽样,就能灵活确定强大的损失目标。然而,现有方法在很大程度上依赖短短短的超时亮度来形成短短的对比信号,从而限制自己使用全球背景。在本文中,我们提议了一种新的视频级对比学习方法,以各部分为基础来形成正对。我们的配方能够在视频中捕捉到全球背景,从而能够对时间内容进行动态变化。我们还加入了一个时间命令正规化术语,以强制执行视频固有的连续结构。广泛的实验显示,我们的视频级对比学习框架(VCLR)能够超越先前五个视频数据集的状态,用于下游行动分类、本地化和视频检索。代码见https://github.com/amazon-research/vical-contrastrast-learlearning。