In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability. Instead, we propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning. The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and conversely facilitates grouping semantically coherent pixels together. Compared with previous efforts, by simultaneously optimizing the two coupled objectives of semantic grouping and contrastive learning, our approach bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. Experiments show our approach effectively decomposes complex scenes into semantic groups for feature learning and significantly benefits downstream tasks, including object detection, instance segmentation, and semantic segmentation. The code will be made publicly available.
翻译:在本文中,我们处理从未贴标签的场景中心数据中学习视觉表现的问题。 现有的工程已经展示了利用以场景中心数据中基本复杂结构的潜力; 但是,它们通常依靠手工制作的物体前期或专门托辞任务来建立一个学习框架, 这可能损害一般性。 相反, 我们提议从数据驱动的语义变异位置, 即Slotcon, 进行联合语义组合和代表性学习。 语义组合是通过向一组可学习的原型分配像素来进行的, 这套原型可以通过在特性和形成新位置上仔细地集合来适应每个样本。 在根据数据而学的空档的基础上, 使用对比性目标来进行演示, 从而增强特征的可变性, 并反过来促进将语义一致性一致的像素组合在一起学习。 与先前的努力相比, 通过优化语义组合和对比性学习的双重目标, 我们的方法可以绕过手工艺型原型原型的劣势, 并且能够从屏幕、 图像中学习对象/分组的图像, 包括快速的图像 。 实验将展示我们有效地学习的下游段 。