Visual localization, i.e., the problem of camera pose estimation, is a central component of applications such as autonomous robots and augmented reality systems. A dominant approach in the literature, shown to scale to large scenes and to handle complex illumination and seasonal changes, is based on local features extracted from images. The scene representation is a sparse Structure-from-Motion point cloud that is tied to a specific local feature. Switching to another feature type requires an expensive feature matching step between the database images used to construct the point cloud. In this work, we thus explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation. We show that this approach can achieve state-of-the-art results. We further show that surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage, and even when rendering raw scene geometry without color or texture. Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research.
翻译:视觉定位,即照相机构成估计的问题,是自动机器人和增强现实系统等应用中的核心组成部分。文献中的一种主要方法,显示在大场景上进行缩放,处理复杂的照明和季节性变化,其依据是从图像中提取的当地特征。现场展示是一种与特定本地特征相连的结构-移动点云。切换到另一种特征类型,需要一种昂贵的特征特征,与用于构建点云的数据库图像相匹配。在这项工作中,我们因此探索一种基于密度为3D模头的更灵活的替代方法,不需要在数据库图像之间进行匹配,以建立场景演示。我们表明,这种方法可以取得最新的结果。我们进一步表明,在提取这些模头的成像的特征时,可以取得惊人的竞争结果,而没有任何神经状态,甚至当在没有颜色或纹理的情况下进行原始场景的几何测量时,甚至当我们的结果显示,密度为3D模型显示,密度为现有展示的一种很有希望的替代方法,并且指向未来研究的有趣和具有挑战性的方向。