We study self-supervised video representation learning, which is a challenging task due to 1) lack of labels for explicit supervision; 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other, but they need a careful treatment of negative pairs by either relying on large batch sizes, memory banks, extra modalities or customized mining strategies, which inevitably includes noisy data. In this paper, we observe that the consistency between positive samples is the key to learn robust video representation. Specifically, we propose two tasks to learn the appearance and speed consistency, respectively. The appearance consistency task aims to maximize the similarity between two clips of the same video with different playback speeds. The speed consistency task aims to maximize the similarity between two clips with the same playback speed but different appearance information. We show that optimizing the two tasks jointly consistently improves the performance on downstream tasks, e.g., action recognition and video retrieval. Remarkably, for action recognition on the UCF-101 dataset, we achieve 90.8\% accuracy without using any extra modalities or negative pairs for unsupervised pretraining, which outperforms the ImageNet supervised pretrained model. Codes and models will be available.
翻译:我们研究的是自我监督的视频代表学习,这是一项艰巨的任务,因为1,缺乏明确监督的标签;2,没有结构化和噪音的视觉信息。现有方法主要使用与视频剪辑的对比性损失作为实例,通过互相区分来学习视觉代表,但是它们需要通过依赖大批量尺寸、记忆库、额外模式或定制的采矿战略来仔细处理负对,这不可避免地包括繁琐的数据。在本文件中,我们观察到,正样之间的一致性是学习强健的视频代表的关键。具体地说,我们建议了两项任务,分别学习外观和速度的一致性。外观一致性任务的目的是最大限度地扩大同一视频两个剪辑之间的相似性,同时以不同的播放速度进行。速度一致性任务的目的是尽可能扩大两个剪辑之间的相似性,同时使用相同的回速率,但不同的外观信息。我们表明,优化这两项任务会共同改善下游任务的业绩,例如行动识别和视频检索。值得注意的是,在UCF-101数据设置上的行动识别,我们实现90.8- ⁇ 的准确性,而不用任何额外模式或负式的模型来监督前训练。