We present Region Similarity Representation Learning (ReSim), a new approach to self-supervised representation learning for localization-based tasks such as object detection and segmentation. While existing work has largely focused on solely learning global representations for an entire image, ReSim learns both regional representations for localization as well as semantic image-level representations. ReSim operates by sliding a fixed-sized window across the overlapping area between two views (e.g., image crops), aligning these areas with their corresponding convolutional feature map regions, and then maximizing the feature similarity across views. As a result, ReSim learns spatially and semantically consistent feature representation throughout the convolutional feature maps of a neural network. A shift or scale of an image region, e.g., a shift or scale of an object, has a corresponding change in the feature maps; this allows downstream tasks to leverage these representations for localization. Through object detection, instance segmentation, and dense pose estimation experiments, we illustrate how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline: $+2.7$ AP$^{\text{bb}}_{75}$ VOC, $+1.1$ AP$^{\text{bb}}_{75}$ COCO, and $+1.9$ AP$^{\text{mk}}$ Cityscapes. Code and pre-trained models are released at: \url{https://github.com/Tete-Xiao/ReSim}
翻译:我们介绍了区域相似性代表学习(ReSim),这是为基于本地化的任务,例如物体探测和分割进行自我监督代表学习的一种新方法。虽然现有的工作主要侧重于仅仅学习整个图像的全球代表,但ReSim学会了本地化的区域代表以及语义图像级别代表。ReSim通过在两种观点(例如图像作物)之间的重叠区域滑动一个固定的窗口运作,使这些地区与相应的富集特征地图区域相匹配,然后使各种观点的特征最大化。因此,ReSim在整个神经网络革命性地貌图中学习空间和语义上一致的特征代表。一个图像区域的变换或规模,例如一个物体的变换或规模。这使得下游任务能够将这些表达方式用于本地化。通过物体探测、实例分割和密集的面状状估计实验,我们说明ReSim学会了显著改进本地化和分类表现,比具有竞争力的纳美/英美/俄先令1的模型:USO-75+先令1 和纳美先令 AP=1。