We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may not move at the same speed, whereas shadows / reflections of an object always move with it but are not part of it. Our insight is to bootstrap objectness by first learning image features from relaxed common fate and then refining them based on visual appearance grouping within the image itself and across images statistically. Specifically, we learn an image segmenter first in the loop of approximating optical flow with constant segment flow plus small within-segment residual flow, and then by refining it for more coherent appearance and statistical figure-ground relevance. On unsupervised video object segmentation, using only ResNet and convolutional heads, our model surpasses the state-of-the-art by absolute gains of 7/9/5% on DAVIS16 / STv2 / FBMS59 respectively, demonstrating the effectiveness of our ideas. Our code is publicly available.
翻译:我们研究从未标记的视频中学习物体分割。人们可以轻松地分割移动对象,而不知道它们是什么。共同命运的格式塔特定律,即具有相同速度的物体属于同一组,已经激发了基于运动分割的无监督对象发现。但是,共同命运不是物体性的可靠指标:关节/可变形对象的某些部分可能不以同样的速度移动,而对象的阴影/反射总是随着它移动,但并不是其一部分。我们的想法是通过首先从放松的共同命运中学习图像特征,然后基于图像内部和跨图像的视觉外观分组统计信息对其进行细化,从而进行物体性自举。具体来说,我们首先在近似具有恒定分段流和小的分段残余流的光流的循环中学习图像分割器,然后通过改进其外观更加连贯和统计意义上的前景-背景关系,对其进行细化。在无监督视频物体分割中,仅使用ResNet和卷积头,我们的模型在DAVIS16 / STv2 / FBMS59上的绝对增益分别为7 / 9 / 5%,证明了我们的想法的有效性。我们的代码公开可用。