We propose a novel visual re-localization method based on direct matching between the implicit 3D descriptors and the 2D image with transformer. A conditional neural radiance field(NeRF) is chosen as the 3D scene representation in our pipeline, which supports continuous 3D descriptors generation and neural rendering. By unifying the feature matching and the scene coordinate regression to the same framework, our model learns both generalizable knowledge and scene prior respectively during two training stages. Furthermore, to improve the localization robustness when domain gap exists between training and testing phases, we propose an appearance adaptation layer to explicitly align styles between the 3D model and the query image. Experiments show that our method achieves higher localization accuracy than other learning-based approaches on multiple benchmarks. Code is available at \url{https://github.com/JenningsL/nerf-loc}.
翻译:我们在本文中提出了一种基于Transformer的新型视觉重新定位方法,该方法在隐式三维描述符和二维图像之间进行直接匹配。我们在管道中选择有条件的神经辐射场(NeRF)作为三维场景表示,支持连续的三维描述符生成和神经渲染。通过将特征匹配和场景坐标回归统一到同一框架中,我们的模型在两个训练阶段分别学习可推广的知识和场景先验。此外,为了在训练和测试阶段之间存在域差异时提高定位的鲁棒性,我们提出了一个外观自适应层,以显式地对齐3D模型和查询图像的样式。实验表明,我们的方法在多个测试数据集上实现了比其他基于学习的方法更高的定位精度。代码可在 \url{https://github.com/JenningsL/nerf-loc} 中获得。