按背景和运动分列的自我监督的视频代表学习 (Self-supervised Video Representation Learning by Context and Motion Decoupling)

A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces), we develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Specifically, we take the keyframes and motion vectors in compressed videos (e.g., in H.264 format) as the supervision sources for context and motion, respectively, which can be efficiently extracted at over 500 fps on the CPU. Then we design two pretext tasks that are jointly optimized: a context matching task where a pairwise contrastive loss is cast between video clip and keyframe features; and a motion prediction task where clip features, passed through an encoder-decoder network, are used to estimate motion features in a near future. These two tasks use a shared video backbone and separate MLP heads. Experiments show that our approach improves the quality of the learned video representation over previous works, where we obtain absolute gains of 16.0% and 11.1% in video retrieval recall on UCF101 and HMDB51, respectively. Moreover, we find the motion prediction to be a strong regularization for video networks, where using it as an auxiliary task improves the accuracy of action recognition with a margin of 7.4%~13.8%.

翻译：在自我监督的视频代表学习中,一个关键的挑战是如何在背景偏差之外有效捕捉运动信息。虽然大多数现有工作都暗含着以视频特定托辞任务(例如预测剪辑订单、时间箭头和步调)实现这一点,但我们开发了一种方法,通过精心设计的托辞任务,明确将运动监督与背景偏差区分开来。具体地说,我们将压缩视频(例如H.264格式)中的关键框架和运动矢量分别作为背景和运动的监督来源,这可以在CPU上超过500英尺处有效提取。然后我们设计了两个共同优化的借口任务:一个匹配任务,在视频剪辑和关键框架特性之间投放对比式的对比损失;一个运动预测任务,通过编码脱coder-解码网络传递的剪辑功能,用来在不远的将来估计运动特征。这两项任务使用共同的视频主干线和不同的MLP头。实验表明,我们的方法提高了以往作品的已学习视频代表质量,我们在那里获得了16.80%和11.1%,我们获得了绝对收益,在视频链中,在视频定位网络中用一个稳定的定位定位定位定位定位定位中,我们可以分别发现一个H.101和SAR化行动。

相关内容

表示学习

关注 185

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

【UC伯克利】自监督视觉表示学习，356页ppt，Self-Supervised Visual Learning

专知会员服务

66+阅读 · 2021年1月10日