A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces), we develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Specifically, we take the keyframes and motion vectors in compressed videos (e.g., in H.264 format) as the supervision sources for context and motion, respectively, which can be efficiently extracted at over 500 fps on the CPU. Then we design two pretext tasks that are jointly optimized: a context matching task where a pairwise contrastive loss is cast between video clip and keyframe features; and a motion prediction task where clip features, passed through an encoder-decoder network, are used to estimate motion features in a near future. These two tasks use a shared video backbone and separate MLP heads. Experiments show that our approach improves the quality of the learned video representation over previous works, where we obtain absolute gains of 16.0% and 11.1% in video retrieval recall on UCF101 and HMDB51, respectively. Moreover, we find the motion prediction to be a strong regularization for video networks, where using it as an auxiliary task improves the accuracy of action recognition with a margin of 7.4%~13.8%.
翻译:在自我监督的视频代表学习中,一个关键的挑战是如何在背景偏差之外有效捕捉运动信息。虽然大多数现有工作都暗含着以视频特定托辞任务(例如预测剪辑订单、时间箭头和步调)实现这一点,但我们开发了一种方法,通过精心设计的托辞任务,明确将运动监督与背景偏差区分开来。具体地说,我们将压缩视频(例如H.264格式)中的关键框架和运动矢量分别作为背景和运动的监督来源,这可以在CPU上超过500英尺处有效提取。然后我们设计了两个共同优化的借口任务:一个匹配任务,在视频剪辑和关键框架特性之间投放对比式的对比损失;一个运动预测任务,通过编码脱coder-解码网络传递的剪辑功能,用来在不远的将来估计运动特征。 这两项任务使用共同的视频主干线和不同的MLP头。 实验表明,我们的方法提高了以往作品的已学习视频代表质量,我们在那里获得了16.80%和11.1%,我们获得了绝对收益,在视频链中,在视频定位网络中用一个稳定的定位定位定位定位定位定位中,我们可以分别发现一个H.101和SAR化行动。