We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos. Existing approaches ignore the specifics of input distortions, e.g., by learning invariance to temporal transformations. Instead, we argue that video representation should preserve video dynamics and reflect temporal manipulations of the input. Therefore, we exploit novel constraints to build representations that are equivariant to temporal transformations and better capture video dynamics. In our method, relative temporal transformations between augmented clips of a video are encoded in a vector and contrasted with other transformation vectors. To support temporal equivariance learning, we additionally propose the self-supervised classification of two clips of a video into 1. overlapping 2. ordered, or 3. unordered. Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks on UCF101, HMDB51, and Diving48.
翻译:我们引入了一种新的自我监督对比学习方法,从未贴标签的视频中学习演示。 现有的方法忽略了输入扭曲的具体细节, 例如通过学习对时间变换的偏差。 相反,我们争辩说视频表达方式应该保留视频动态并反映对输入的时间操纵。 因此,我们利用新颖的限制来构建对时间变换和更好地捕捉视频动态等同的表达方式。 在我们的方法中,一个视频增量剪片段之间的相对时间变换在矢量中编码,与其他变异矢量的对比。 为了支持时间变异学习,我们又提议对视频的两个剪辑进行自我监督的分类,将其分为1. 重叠 2. 订购或 3. 未经排序。 我们的实验显示,时间变换表达方式在视频检索和动作识别基准UCF101、 HMDB51 和 Diving48 上取得了最新的结果。