UniPR-3D：迈向基于视觉几何Transformer的通用视觉位置识别 (UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer)

Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

翻译：视觉位置识别传统上被表述为单图像检索任务。使用多视图具有明显优势，但这一设定仍相对未被充分探索，且现有方法往往难以在不同环境中泛化。本文提出UniPR-3D，这是首个能有效整合多视图信息的VPR架构。UniPR-3D基于能够编码多视图3D表示的VGGT主干网络构建，我们通过设计特征聚合器并针对位置识别任务进行微调来改进它。为构建描述符，我们联合利用VGGT产生的3D token和中间2D token。基于它们的不同特性，我们为2D和3D特征设计了专用的聚合模块，使我们的描述符既能捕获细粒度纹理线索，又能进行跨视角推理。为进一步增强泛化能力，我们结合了单帧与多帧聚合方案，以及一种可变长度序列检索策略。实验表明，UniPR-3D确立了新的最优性能，超越了单视图和多视图基线，并突显了基于几何的token对VPR的有效性。我们的代码和模型将在Github上公开：https://github.com/dtc111111/UniPR-3D。