Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant representations from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to geometric transformations such as random crops and multi-crops. Furthermore, the model may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture that learns features sensitive to geometric transformation like four-fold rotation, random crop, and multi-crop. Our method encourages the student to learn sensitive features by increasing the similarity between overlapping regions not entire views. and applying rotations to the target feature map. Additionally, we use a patch correspondence loss to capture long-term dependencies. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that learn geometric transformation-invariant representations. We surpass DINO baseline in tasks such as image classification, semantic segmentation, detection, and instance segmentation with improvements of 6.1 $Acc$, 0.6 $mIoU$, 0.4 $AP^b$, and 0.1 $AP^m$.
翻译:大多数基于不变性的自我监督方法依靠单个以物体为中心的图像(例如ImageNet图像)进行预训练,从几何变换中学习不变的表示。然而,当图像不是以物体为中心时,由于随机裁剪和多裁剪等几何变换,图像的语义可能会明显改变。此外,模型可能难以捕获位置信息。因此,我们提出了一种对几何变换敏感的架构,该架构学习对几何变换敏感的特征,如四重旋转、随机裁剪和多裁剪。我们的方法鼓励学生通过增加重叠区域的相似性而不是整个视图,旋转目标特征映射,学习敏感特征。此外,我们使用Patch Correspondence Loss捕捉长期依赖关系。我们的方法证明了在使用非以对象为中心的图像作为预训练数据时比其他学习几何变换不变表示的方法具有更好的性能。我们超越DINO基线在诸如图像分类、语义分割、检测和实例分割等任务中,取得了6.1 $Acc$、0.6 $mIoU$、0.4 $AP^b$和0.1 $AP^m$的改进。