In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.
翻译:在本文中, 我们侧重于从未贴标签的视频中学习视频对象分割( VOS) 的未经监督的学习。 之前的方法主要基于对比学习模式, 在图像级别或像素级别上优化。 图像层次优化( 例如 ResNet 的空间集合特性) 学习了强大的高层次语义, 但是由于像素级的功能被暗中优化, 是次优化的 20 。 相反, 像素级优化对于培训数据的视觉特征质量比较明确, 并且不强于反对变形。 为了在统一的框架中补充这两个水平的优化, 我们建议从纯基因化的角度进行图像优化( ResNet 的空间集合特性) 学习高层次的类符号和补丁标志。 具体地说, 为了图像级的优化, 我们把从本地到全球的图像级优化优化, 将我们从高层次的图像结构到图像级的升级阶段 。 将我们从高层次的图像级的图像级 学习到我们从高层次的图像级的图像级的升级 。 将我们从高级图像级的图像级的升级到图像级的图像级的升级 。 在图像级的图像级的升级阶段里, 学习我们从高层次的图像级的图像级的图像级的演示到我们从高级的图像级的升级到图像级级的升级的升级的图像级的升级的演示到图像级的升级的升级的升级的升级到图像级, 。