We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.
翻译:我们提出了一种用于学习以运动为中心的视频表示的自监督方法。现有方法通过最小化同步扩展视频之间的距离来实现高空间相似性。我们相反,建议学习具有相同局部运动动态但外观不同的视频之间的相似性。我们通过向视频添加合成运动轨迹(我们称之为小管)来实现这一目标。通过模拟不同的小管运动并应用转换(例如缩放和旋转),我们引入了超出预训练数据中存在的运动模式。这使我们能够学习一个非常数据有效的视频表示:当使用预训练视频的25%时,我们的方法仍然能够保持性能。对10个不同的下游任务进行的实验表明了我们的竞争性能和对新领域和细粒度动作的泛化能力。