Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the mapping from image pixels to scene coordinates with neural networks to build 2D-3D correspondences for camera pose optimization. However, such memorization requires training by amounts of posed images in each scene, which is heavy and inefficient. On the contrary, few-shot images are usually sufficient to cover the main regions of a scene for a human operator to perform visual localization. In this paper, we propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images. Our insight is leveraging a) pre-learned feature extractor, b) scene region classifier, and c) meta-learning strategy to accelerate training while mitigating overfitting. We evaluate our method on both indoor and outdoor benchmarks. The experiments validate the effectiveness of our method in the few-shot setting, and the training time is significantly reduced to only a few minutes. Code available at: \url{https://github.com/siyandong/SRC}
翻译:视觉( re) 本地化解决了估算在已知场景中拍摄的6-DoF(自由时代)摄像头(Degree of Freedom)摄像头(Degree of Free)摄像头(Degree of Freedom)拍摄的查询图像的问题,这是许多计算机视觉和机器人应用的关键组成部分。基于结构的本地化最近的进展通过将图像像素映射到与神经网络的场景坐标映像来进行2D-3D通信的映像,以优化摄像头。然而,这种记忆化需要在每个场景中用大量显示图像来进行培训,这些图像既重又低效。相反,少发图像通常足以覆盖一个人类操作员进行视觉本地化的主要场景区。在本论文中,我们建议采用一个场景区域分类方法分类方法,以几发图像快速有效地进行场景模拟映像。 我们的洞察功能正在利用一种( ) 预学性地貌的地貌提取器, (b) 现场区域分类, 和室外测量时间将大大缩短。