Self-supervised representation learning is able to learn semantically meaningful features; however, much of its recent success relies on multiple crops of an image with very few objects. Instead of learning view-invariant representation from simple images, humans learn representations in a complex world with changing scenes by observing object movement, deformation, pose variation, and ego motion. Motivated by this ability, we present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes with many moving objects. Our framework features a simple flow equivariance objective that encourages the network to predict the features of another frame by applying a flow transformation to the features of the current frame. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images. Readout experiments on challenging semantic segmentation, instance segmentation, and object detection benchmarks show that we are able to outperform representations obtained from previous state-of-the-art methods including SimCLR and BYOL.
翻译:自我监督的演示学习能够学习具有语义意义的特点;然而,其近期成功在很大程度上依赖于一个图像的多种作物,只有很少的物体。与其从简单的图像中学习视觉变化的描述,人类在复杂的世界里通过观察物体运动、变形、变异和自我运动来学习变化的场景。受这种能力驱动,我们提出了一个新的自我监督的学习描述框架,可以直接在包含许多移动物体的复杂场景的视频流流中部署。我们的框架有一个简单的流程变异目标,它鼓励网络通过对当前框架的特征进行流程转换来预测另一个框架的特征。我们从高分辨率原始视频中学习的演示可以很容易地用于静态图像的下游任务。关于挑战语义分割、实例分割和物体探测基准的阅读实验显示,我们能够超越从以往的状态-艺术方法中获得的演示,包括SimCLR和BYOL。