The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.
翻译:本论文的目标是自监督学习视频对象分割。我们开发了一个统一的框架,它同时建模跨帧稠密对应关系以进行本地辨别特征学习,并嵌入对象级上下文以进行目标掩码解码。因此,它能够直接从未标记的视频中学习执行基于掩码的连续分割,与先前的尝试通常依赖于一个斜率解决方案——根据像素相关性“简单”地“复制”标签形成对比。具体来说,我们的算法在伪分割标签的基础上交替执行以下几步:i)聚类视频像素来创建伪分割标签;以及ii)利用伪标签学习用于VOS的掩模编码和解码。此外,将无监督对应关系学习进一步纳入这种自学掩模嵌入方案中,以确保所学表示的普适性并避免群集退化。我们的算法在两个标准基准测试中设置了最先进的技术水平(即DAVIS17和YouTube-VOS),缩小了自监督和全监督VOS之间的差距,无论在性能还是网络架构设计方面。