We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
翻译:与先前的工作不同,我们的配方可以直接在完全进化的制度中学习密集的特征表现。我们依靠统一的网格取样来提取一组锚,并训练我们的模型来在视频之间和视频内部两个层次上分离。然而,一个培训这种模型的天真的计划导致一个堕落的解决办法。我们提议通过一个简单的常规化计划来防止这种情况,将分离任务的等同性与相似性转变结合起来。我们的培训目标承认了高效的实施并展示了快速培训的趋同。在既定的VOS基准方面,我们的方法超过了先前工作的分解准确性,尽管我们使用的培训数据和计算能力要少得多。