The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers this image structure by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.
翻译:自我监督学习(SSL)的许诺是利用大量未贴标签的数据来解决复杂的任务。虽然在简单图像层面的学习方面已经取得了极佳的进展,但最近的方法显示出了将图像结构知识包括在内的优势。然而,通过采用手工制作的图像分割法来界定感兴趣的区域,或采用专门的增强战略,这些方法牺牲了使SSL如此强大的简单和笼统的简单和普遍性。相反,我们提出了一个自我监督的学习模式,这种模式可以自己发现这种图像结构。我们的方法,Odin,夫妇的发现和展示网络在没有任何监督的情况下发现有意义的图像分割。由此形成的学习模式简单、简便和更加笼统,并取得了最先进的转移学习结果,用于对COCO的物体探测和实例分割,以及在PSCAL和城市景区进行语义分割,同时大大超过DVIS上监控的视频分割前培训。