Motion, measured via optical flow, provides a powerful cue to discover and learn objects in images and videos. However, compared to using appearance, it has some blind spots, such as the fact that objects become invisible if they do not move. In this work, we propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns, and thus likely to correspond to objects. We apply this network in two modes. In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos. In the unsupervised image segmentation model, the network is learned using videos and applied to segment independent still images. With this, we obtain strong empirical results in unsupervised video and image segmentation, significantly outperforming the state of the art on benchmarks such as DAVIS, sometimes with a $5\%$ IoU gap.
翻译:通过光学流测量的图像和视频运动,为发现和学习图像和视频中的物体提供了强大的提示。然而,与使用外观相比,它有一些盲点,例如,物体不移动就会变得看不见。在这项工作中,我们提出一种方法,将运动和外观分割法的优点结合起来。我们提议监督一个图像分割网,将它与可能包含简单动作图案因而可能与对象相对应的预测区域挂钩。我们用两种模式应用这个网络。在未受监督的视频分割法模式中,这个网络在收集未贴标签的视频方面受过培训,利用学习过程本身作为算法来分割这些视频。在未受监督的图像分割模型中,这个网络利用视频来学习,并应用到片段独立图像。有了这个模式,我们在未受监督的视频和图像分割法中获得了强大的实证结果,大大优于DAVIS等基准的艺术状态,有时存在5美元IoU差距。