Visual localization is of great importance in robotics and computer vision. Recently, scene coordinate regression based methods have shown good performance in visual localization in small static scenes. However, it still estimates camera poses from many inferior scene coordinates. To address this problem, we propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks. In the landmark generation stage, the 3D surfaces of the target scene are over-segmented into mosaic patches whose centers are regarded as the scene-specific landmarks. To robustly and accurately recover the scene-specific landmarks, we propose the Voting with Segmentation Network (VS-Net) to segment the pixels into different landmark patches with a segmentation branch and estimate the landmark locations within each patch with a landmark location voting branch. Since the number of landmarks in a scene may reach up to 5000, training a segmentation network with such a large number of classes is both computation and memory costly for the commonly used cross-entropy loss. We propose a novel prototype-based triplet loss with hard negative mining, which is able to train semantic segmentation networks with a large number of labels efficiently. Our proposed VS-Net is extensively tested on multiple public benchmarks and can outperform state-of-the-art visual localization methods. Code and models are available at \href{https://github.com/zju3dv/VS-Net}{https://github.com/zju3dv/VS-Net}.
翻译:视觉本地化在机器人和计算机视觉中非常重要。 最近, 现场协调回归法在小型静态场景的视觉本地化中表现良好。 但是, 它仍然估计相机在很多次低劣的场景坐标中的位置。 为了解决这个问题, 我们提出一个新的视觉本地化框架, 在查询图像和3D地图之间建立 2D 到 3D 的对应, 包含一系列可学习的场景特定地标。 在里程碑生成阶段, 目标场景的3D 表面被过度分割成 masaic 补丁, 其中心被视为特定场景的地标。 为了强有力和准确地恢复特定场景的地标, 我们提议通过分界化网络( VS- Net) 将像素分割成不同的地标补丁。 由于一个场景的地标数可能达到5000, 因此, 目标场景的3D 表面的分界化网络 将大量跨场点/ 视频网络 进行计算和记忆成本 。 我们提议采用基于新版本的图像S- 3real- develop Streal- develop silation comlistration