Visual place recognition (VPR) using deep networks has achieved state-of-the-art performance. However, most of them require a training set with ground truth sensor poses to obtain positive and negative samples of each observation's spatial neighborhood for supervised learning. When such information is unavailable, temporal neighborhoods from a sequentially collected data stream could be exploited for self-supervised training, although we find its performance suboptimal. Inspired by noisy label learning, we propose a novel self-supervised framework named \textit{TF-VPR} that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods. Our method follows an iterative training paradigm which alternates between: (1) representation learning with data augmentation, (2) positive set expansion to include the current feature space neighbors, and (3) positive set contraction via geometric verification. We conduct comprehensive experiments on both simulated and real datasets, with either RGB images or point clouds as inputs. The results show that our method outperforms our baselines in recall rate, robustness, and heading diversity, a novel metric we propose for VPR. Our code and datasets can be found at https://ai4ce.github.io/TF-VPR/.
翻译:使用深层网络的视觉位置识别(VPR)已经实现了最先进的性能。然而,其中多数需要一套带有地面真知灼见传感器的培训,以获得每个观测空间邻里的积极和消极样本,供监督学习。当这种信息不存在时,可以利用按顺序收集的数据流的时空邻里进行自我监督培训,尽管我们发现其性能不尽人意。我们受到吵闹的标签学习的启发,我们提议了一个名为\ textit{TFTF-VPR}的新颖的自我监督框架,利用时空邻里和可学习的特征邻里来发现未知的空间邻里。我们的方法遵循一种迭代式培训模式,即:(1) 以数据扩增为代表学习,(2) 积极扩展为包括当前特征空间邻里,(3) 通过几何校校校校校进行积极的收缩。我们在模拟和真实的数据集上进行综合实验,使用RGB图像或点云作为投入。结果显示,我们的方法超越了我们的记忆率、稳健度和走向多样性的基线,我们为VPR提出新的指标。我们的代码和数据集可以在 https://io4中找到。