Recent works have advanced the performance of self-supervised representation learning by a large margin. The core among these methods is intra-image invariance learning. Two different transformations of one image instance are considered as a positive sample pair, where various tasks are designed to learn invariant representations by comparing the pair. Analogically, for video data, representations of frames from the same video are trained to be closer than frames from other videos, i.e. intra-video invariance. However, cross-video relation has barely been explored for visual representation learning. Unlike intra-video invariance, ground-truth labels of cross-video relation is usually unavailable without human labors. In this paper, we propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning. This allows to collect positive sample pairs across different video instances, which we hypothesize will lead to higher-level semantics. We validate our method by transferring our image representation to multiple downstream tasks including visual object tracking, image classification, and action recognition. We show significant improvement over state-of-the-art contrastive learning methods. Project page is available at https://happywu.github.io/cycle_contrast_video.
翻译:最近的作品大大促进了自我监督的演示学习的绩效。 这些方法的核心是模拟内差异学习。 一个图像实例的两种不同的变换被视为一种积极的样板。 两种不同的图像实例被视为一种积极的样板, 设计了各种任务, 以便通过比较来学习异样演示。 从分析角度来说, 视频数据, 同一视频框架的展示经过培训比其他视频的更接近, 即视频内差异。 但是, 视觉演示学习几乎没有探索跨视频关系。 与视频内差异不同, 没有人类劳动通常无法找到跨视频关系的地面真相标签。 在本文中, 我们提出一种新的对比学习方法, 通过使用循环一致性来探索跨视频代表学习的跨视频关系。 这样可以收集不同视频实例中的正样配, 即视频内部差异将会导致更高层次的语义学。 我们验证我们的方法, 将图像代表转换为多个下游任务, 包括视觉对象跟踪、 图像分类和动作识别。 我们展示了州- 州- 州- 循环 ALVALVAL 的对比学习方法。