Visual localization is a core component in many applications, including augmented reality (AR). Localization algorithms compute the camera pose of a query image w.r.t. a scene representation, which is typically built from images. This often requires capturing and storing large amounts of data, followed by running Structure-from-Motion (SfM) algorithms. An interesting, and underexplored, source of data for building scene representations are 3D models that are readily available on the Internet, e.g., hand-drawn CAD models, 3D models generated from building footprints, or from aerial images. These models allow to perform visual localization right away without the time-consuming scene capturing and model building steps. Yet, it also comes with challenges as the available 3D models are often imperfect reflections of reality. E.g., the models might only have generic or no textures at all, might only provide a simple approximation of the scene geometry, or might be stretched. This paper studies how the imperfections of these models affect localization accuracy. We create a new benchmark for this task and provide a detailed experimental evaluation based on multiple 3D models per scene. We show that 3D models from the Internet show promise as an easy-to-obtain scene representation. At the same time, there is significant room for improvement for visual localization pipelines. To foster research on this interesting and challenging task, we release our benchmark at v-pnk.github.io/cadloc.
翻译:视觉定位是许多应用程序(包括增强现实(AR))的核心组成部分。定位算法计算查询图像相对于场景表示的相机姿态,该场景表示通常由图像构建。这通常需要捕获和存储大量数据,然后运行结构从运动(SfM)算法。建立场景表示的有趣而不足的数据来源是来自互联网的3D模型,例如手绘CAD模型,从建筑物足迹生成的3D模型或从航空图像生成的3D模型。这些模型允许立即执行视觉定位而无需耗费时间的场景捕获和模型构建步骤。然而,它也面临挑战,因为可用的3D模型通常是现实的不完美反映。例如,模型可能只具有通用的或根本没有的纹理,可能仅提供场景几何形状的简单近似,或者可能被拉伸。本文研究这些模型的缺陷如何影响定位精度。我们为此任务创建了一个新的基准,并基于每个场景的多个3D模型提供了详细的实验评估。我们展示了来自互联网的3D模型显示出作为易获得场景表示的希望。同时,视觉定位管道有很大的改进空间。为了促进这个有趣和具有挑战性的任务的研究,我们在v-pnk.github.io/cadloc上发布了我们的基准测试。