Most existing approaches for visual localization either need a detailed 3D model of the environment or, in the case of learning-based methods, must be retrained for each new scene. This can either be very expensive or simply impossible for large, unknown environments, for example in search-and-rescue scenarios. Although there are learning-based approaches that operate scene-agnostically, the generalization capability of these methods is still outperformed by classical approaches. In this paper, we present an approach that can generalize to new scenes by applying specific changes to the model architecture, including an extended regression part, the use of hierarchical correlation layers, and the exploitation of scale and uncertainty information. Our approach outperforms the 5-point algorithm using SIFT features on equally big images and additionally surpasses all previous learning-based approaches that were trained on different data. It is also superior to most of the approaches that were specifically trained on the respective scenes. We also evaluate our approach in a scenario where only very few reference images are available, showing that under such more realistic conditions our learning-based approach considerably exceeds both existing learning-based and classical methods.
翻译:大多数现有视觉定位方法要么需要详细的环境3D模型,或者在学习方法的情况下,必须为每个新的场景重新培训。对于大型的、未知的环境,例如搜索和救援情景,这要么非常昂贵,要么根本不可能。虽然有一些基于学习的方法在操作场景和救援的情景中运作,但这些方法的概括性能力仍然比经典方法的超额表现。在本文中,我们提出了一个方法,通过对模型结构进行具体修改,包括扩大回归部分、使用等级相关层以及利用规模和不确定性信息,可以概括到新的场景。我们的方法超过了5点算法,即使用同样大图像上的SIFT特征,并额外超过了以往所有根据不同数据培训的基于学习的方法。我们还优于大多数在不同的场景上专门培训过的方法。我们还在一种只有很少参考图像的场景中评估我们的方法,表明在这种更现实的条件下,我们的基于学习的方法远远超过了现有的基于学习的方法和经典方法。