This work presents a self-supervised method to learn dense semantically rich visual concept embeddings for images inspired by methods for learning word embeddings in NLP. Our method improves on prior work by generating more expressive embeddings and by being applicable for high-resolution images. Viewing the generation of natural images as a stochastic process where a set of latent visual concepts give rise to observable pixel appearances, our method is formulated to learn the inverse mapping from pixels to concepts. Our method greatly improves the effectiveness of self-supervised learning for dense embedding maps by introducing superpixelization as a natural hierarchical step up from pixels to a small set of visually coherent regions. Additional contributions are regional contextual masking with nonuniform shapes matching visually coherent patches and complexity-based view sampling inspired by masked language models. The enhanced expressiveness of our dense embeddings is demonstrated by significantly improving the state-of-the-art representation quality benchmarks on COCO (+12.94 mIoU, +87.6\%) and Cityscapes (+16.52 mIoU, +134.2\%). Results show favorable scaling and domain generalization properties not demonstrated by prior work.
翻译:这项工作提出了一种自我监督的方法,用于学习浓密的像素和概念之间的反向映射。我们的方法通过生成更直观的嵌入和高分辨率图像应用,改进了先前工作。我们的方法通过生成更直观的嵌入和对高分辨率图像适用的方式,改进了先前工作。将自然图像的生成视为一个随机过程,使一组潜在视觉概念产生可见像素外观,我们的方法是用来学习从像素到概念之间的反向映射。我们的方法通过引入超级螺旋化作为从像素升至一小组视觉一致性区域的自然等级步骤,大大提高了密集嵌入地图的自我监督学习的实效。其他贡献是区域背景遮罩,其非统一形状与视觉一致性的补补补和由遮蔽语言模型启发的复杂视图取样相匹配。我们密集嵌入的超强的直观性能表现通过大幅改进COCO(+12.94 mIo, +87.6 ⁇ )的状态代表质量基准,大大改进了CO(+16.52 mI) 和Calcoveralalalalalalizalization (+_x) ex) exual aview.