Visual place recognition (VPR) is a fundamental task of computer vision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does not consider continuous relations of similarity between images of the same place taken from different positions, determined by the continuous nature of camera pose. The binary similarity induces a noisy supervision signal into the training of VPR methods, which stall in local minima and require expensive hard mining algorithms to guarantee convergence. Motivated by the fact that two images of the same place only partially share visual cues due to camera pose differences, we deploy an automatic re-annotation strategy to re-label VPR datasets. We compute graded similarity labels for image pairs based on available localization metadata. Furthermore, we propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks. We demonstrate that the use of the new labels and GCL allow to dispense from hard-pair mining, and to train image descriptors that perform better in VPR by nearest neighbor search, obtaining superior or comparable results than methods that require expensive hard-pair mining and re-ranking techniques. Code and models available at: https://github.com/marialeyvallina/generalized_contrastive_loss
翻译:地标识别是计算机视觉中视觉定位的一个基本任务。现有的方法使用表明图像是相同场所的图像对或者不同场所的图像对进行训练。这样的二元指示不考虑从不同位置拍摄的同一场所的图像之间的相似性关系的连续性。二元相似性造成噪声监督信号进入地标识别方法的训练过程,导致结果停滞在局部最小值并且需要昂贵的困难样本挖掘算法来保证收敛。由于同一地方的两张图片只部分共享视觉线索,由于摄像机姿态差异的连续性质,我们部署了自动重新注释的策略以重新标记地标识别数据集。我们根据可用的定位元数据计算图片对的分级相似性标签。此外,我们提出了一种新的广义对比损失(GCL),它使用分级相似性标签来训练对比网络。我们证明了使用新标签和GCL可以不需要困难样本挖掘,训练出在通过最近邻搜索的地标识别中表现更好的图像描述符,其结果优于或与需要昂贵的困难样本挖掘和重排序技术的方法相当。代码和模型可以在https://github.com/marialeyvallina/generalized_contrastive_loss找到。