Camera pose estimation or camera relocalization is the centerpiece in numerous computer vision tasks such as visual odometry, structure from motion (SfM) and SLAM. In this paper we propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. In contrast with prior work where the pose regression is mainly guided by photometric consistency, TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes and is trained towards the graph consistency and accuracy instead, yielding significantly higher computational efficiency. By leveraging graph transformer layers with edge features and enabling tensorized adjacency matrix, TransCamP dynamically captures the global attention and thus endows the pose graph with evolving structures to achieve improved robustness and accuracy. In addition, optional temporal transformer layers actively enhance the spatiotemporal inter-frame relation for sequential inputs. Evaluation of the proposed network on various public benchmarks demonstrates that TransCamP outperforms state-of-the-art approaches.
翻译:相机显示估计或相机重新定位是许多计算机视觉任务的核心,例如视觉观察仪、运动结构(SfM)和SLAM。在本文中,我们提出一个带有图形变压器主干线的神经网络方法,即TransCamP,以解决相机重新定位问题。与以前主要以光度一致性为引导的图像回归工作相比,TransCamP有效地将图像特征、相机显示信息以及内部相对相机动作结合成编码图形属性,并培训其走向图形的一致性和准确性,从而产生显著更高的计算效率。通过利用带有边缘特征的图形变压器层,并促成抗拉相矩阵,TransCamP动态地捕捉了全球的注意力,从而将图形形图与不断演变的结构联系起来,以提高稳健性和准确性。此外,可选的时间变压层还积极加强连续输入的波段间框架关系。对各种公共基准的拟议网络的评估表明,TransCamP超越了艺术的状态。