We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. We show that representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
翻译:对比式学习的一个共同范例是,通过对同一实例的不同数据观点进行抽样抽样,将不同的数据实例作为负数,来构建正对。这些方法暗含地假定了对视觉选择机制的一套代表差异(例如,抽样框架与时间变化),这可能导致下游任务业绩不佳,从而违反这些差异(从时间信息中受益的细微视频动作识别)。为了克服这一限制,我们提议了一个“提高认知度”对比式学习框架,明确提供一系列增强参数(例如用于创建数据视图的时间变化值),作为可比较的增强编码(CATE),作为模型,用于为对比性学习而预测视频表达方式。我们表明,通过我们的方法所学到的表达方式,编码了有关特定空间或时间增强的宝贵信息,并在这样做时,在一些视频基准上实现了最先进的表现。