Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: https://github.com/facebookresearch/VICRegL
翻译:最新的自我监督的图像表现学习方法侧重于产生具有差异特性的全球特征,或产生一套本地特征。前者最适合分类任务,而后者最适合探测和分割任务。本文件探讨了学习当地和全球特征之间的根本平衡。建议采用一种名为UNICRegL的新方法,同时学习良好的全球和地方特征,在探测和分割任务方面产生出色的业绩,同时保持在分类任务方面的良好业绩。具体地说,标准共振网结构的两个相同的分支被输入两个不同扭曲的同一图像版本。国际中心区域标准适用于全球特征矢量的对配。同时,国际中心区域标准适用于最后一个集合层之前发生的本地特性矢量的对配。如果其l2-距离低于临界值,或如果其相对位置与两种输入图像之间的已知几何变一致,则两个本地特性矢量相互吸引。我们在线性分类和分化转移任务上表现很强。代码和预先培训的模型公布在:https://githhub.com/facebriskapreasion/CRIGRIGREGLA。