We devise a graph attention network-based approach for learning a scene triangle mesh representation in order to estimate an image camera position in a dynamic environment. Previous approaches built a scene-dependent model that explicitly or implicitly embeds the structure of the scene. They use convolution neural networks or decision trees to establish 2D/3D-3D correspondences. Such a mapping overfits the target scene and does not generalize well to dynamic changes in the environment. Our work introduces a novel approach to solve the camera relocalization problem by using the available triangle mesh. Our 3D-3D matching framework consists of three blocks: (1) a graph neural network to compute the embedding of mesh vertices, (2) a convolution neural network to compute the embedding of grid cells defined on the RGB-D image, and (3) a neural network model to establish the correspondence between the two embeddings. These three components are trained end-to-end. To predict the final pose, we run the RANSAC algorithm to generate camera pose hypotheses, and we refine the prediction using the point-cloud representation. Our approach significantly improves the camera pose accuracy of the state-of-the-art method from $0.358$ to $0.506$ on the RIO10 benchmark for dynamic indoor camera relocalization.
翻译:我们设计了一个基于图形关注网络的方法,用于学习场景三角网格显示器,以估计动态环境中的图像摄像器位置。先前的方法建立了一个基于场景的模型,以明确或隐含地嵌入场景结构。它们使用卷发神经网络或决定树来建立 2D/3D-3D 对应物。这样的测绘图在目标场景上是适合目标场景的,并不全面适用于环境的动态变化。我们的工作引入了一种新颖的方法,通过利用现有三角网格来解决相机重新定位问题。我们的3D-3D匹配框架由三个区块组成:(1) 一个图形神经网络,以计算嵌入网格的嵌入;(2) 一个革命神经网络,以计算RGB-D 图像上定义的电网嵌入;(3) 一个神经网络模型,以建立两个嵌入场图之间的对应物。这三个组件经过了培训,最后的端到端。为了预测最终的布局,我们运行RANSAC算法,以产生相机的假称,我们用点角显示器改进了预测。我们的方法大大改进了R558美元的摄像头模型,从摄像头重新定位到摄像师0.050.0.0.30的摄像头。