Visual place recognition is a challenging task for applications such as autonomous driving navigation and mobile robot localization. Distracting elements presenting in complex scenes often lead to deviations in the perception of visual place. To address this problem, it is crucial to integrate information from only task-relevant regions into image representations. In this paper, we introduce a novel holistic place recognition model, TransVPR, based on vision Transformers. It benefits from the desirable property of the self-attention operation in Transformers which can naturally aggregate task-relevant features. Attentions from multiple levels of the Transformer, which focus on different regions of interest, are further combined to generate a global image representation. In addition, the output tokens from Transformer layers filtered by the fused attention mask are considered as key-patch descriptors, which are used to perform spatial matching to re-rank the candidates retrieved by the global image features. The whole model allows end-to-end training with a single objective and image-level supervision. TransVPR achieves state-of-the-art performance on several real-world benchmarks while maintaining low computational time and storage requirements.
翻译:对自主驾驶导航和移动机器人定位等应用而言,视觉位置识别是一项艰巨的任务。在复杂场景中呈现的扰动元素往往导致视觉位置认知的偏差。为了解决这一问题,将仅与任务相关的区域的信息整合到图像展示中至关重要。在本文中,我们引入了一个新的整体位置识别模型TransVPR,以愿景变异器为基础。它得益于在变异器中进行自我关注操作的可取特性,这种操作可以自然地将任务相关特性综合在一起。多层次的变异器的注意力进一步组合起来,从而产生全球图像显示。此外,由装配式注意面罩过滤的变异器层的输出符号被视为关键端码解记器,用于进行空间匹配,以重新排列通过全球图像特征检索的候选人。整个模型允许在单一客观和图像层面的监管下进行端对端培训。 TransVPR在保持低计算时间和存储要求的同时,在几个真实世界基准上取得了最新表现。