Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC.
翻译:在缺乏标签的情况下学习密集视觉表示是一个艰巨的任务,特别是对于场景中心数据。我们提出了一个交叉视角一致性目标和在线聚类机制(CrOC)来发现和分割视图的语义。在没有手工制作的先验知识的情况下,得到的方法更具有普适性,不需要繁琐的预处理步骤。更重要的是,聚类算法同时在两个视图的特征上操作,从而优雅地绕过了内容在两个视图中没有表示和从一个裁剪到另一个的模糊匹配的问题。我们在各种数据集上展示了线性和无监督分割转移任务以及视频对象分割的出色表现。我们的代码和预训练模型可以在https://github.com/stegmuel/CrOC 上公开获取。