In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins. Code is available: https://github.com/pansanity666/INO_VOS
翻译:在本文中, 我们侧重于从未贴标签的视频中学习视频对象分割( VOS) 的未经监督的学习。 之前的方法主要基于对比学习模式, 在图像级别或像素级别上优化。 图像层次优化( 例如, ResNet 的空间集合特性) 学习了强大的高层次语义, 但是由于像素级的功能被默认地优化, 是次优化的 20 。 相比之下, 像素级的优化对于培训数据的视觉质量比较明确, 并且并不强烈地反对变形。 为了在统一的框架中补充这两个水平的优化, 我们建议从纯基因化的角度进行图像优化( ResNet 的空间集合特性) 学习高层次的类符号和图象变异性 。 具体地说, 为了图像级的优化, 我们把外观的想象力从本地到全球的图像级的图像质量质量质量, 将我们从高层次的图像变异性结构到图像级的变现, 将我们从高层次的图像级变现, 在图像级的变现中, 我们从高层次的变现到图像级的变现, 。