This paper introduces a novel method for self-supervised video representation learning via feature prediction. In contrast to the previous methods that focus on future feature prediction, we argue that a supervisory signal arising from unobserved past frames is complementary to one that originates from the future frames. The rationale behind our method is to encourage the network to explore the temporal structure of videos by distinguishing between future and past given present observations. We train our model in a contrastive learning framework, where joint encoding of future and past provides us with a comprehensive set of temporal hard negatives via swapping. We empirically show that utilizing both signals enriches the learned representations for the downstream task of action recognition. It outperforms independent prediction of future and past.
翻译:本文介绍了一种通过地貌预测进行自我监督的视频演示学习的新方法。与以前侧重于未来地貌预测的方法不同,我们认为,未经观察的过去框架所产生的监督信号是对未来框架所产生的信号的补充。我们的方法的理由是鼓励网络通过区分未来和过去对当前观测结果来探索视频的时间结构。我们用一个对比式学习框架来培训我们的模型,在这个框架中,未来和过去的共同编码通过互换为我们提供了一整套时间硬底片。我们从经验上表明,利用这两种信号可以丰富对下游行动识别任务所学的表述。它超越了对未来和过去的独立预测。