Recent self-supervised computer vision methods have demonstrated equal or better performance to supervised methods, opening for AI systems to learn visual representations from practically unlimited data. However, these methods are classification-based and thus ineffective for learning dense feature maps required for unsupervised semantic segmentation. This work presents a method to effectively learn dense semantically rich visual concept embeddings applicable to high-resolution images. We introduce superpixelization as a means to decompose images into a small set of visually coherent regions, allowing efficient learning of dense semantics by swapped prediction. The expressiveness of our dense embeddings is demonstrated by significantly improving the SOTA representation quality benchmarks on COCO (+16.27 mIoU) and Cityscapes (+19.24 mIoU) for both low- and high-resolution images.
翻译:最近自我监督的计算机视觉方法显示,在监督方法方面业绩平等或更好,使AI系统能够从几乎无限的数据中学习视觉表现,然而,这些方法基于分类,因此对于学习无监督的语系分化所需的密集地貌图象来说是无效的。这项工作提供了一种方法,可以有效地学习适用于高分辨率图像的密集的精密的语系概念嵌入。我们引入了超分化,将图像分解成一小组视觉一致的区域,从而通过互换预测有效地学习密集的语系。我们密集嵌入的语系的表达性表现力表现体现在大幅度改进COCO(+16.27 mIoU)和市景(+19.24 mIoU)的SOTA代表低分辨率图像质量基准上。