We address the problem of camera pose estimation in visual localization. Current regression-based methods for pose estimation are trained and evaluated scene-wise. They depend on the coordinate frame of the training dataset and show a low generalization across scenes and datasets. We identify the dataset shift an important barrier to generalization and consider transfer learning as an alternative way towards a better reuse of pose estimation models. We revise domain adaptation techniques for classification and extend them to camera pose estimation, which is a multi-regression task. We develop a deep adaptation network for learning scene-invariant image representations and use adversarial learning to generate such representations for model transfer. We enrich the network with self-supervised learning and use the adaptability theory to validate the existence of scene-invariant representation of images in two given scenes. We evaluate our network on two public datasets, Cambridge Landmarks and 7Scene, demonstrate its superiority over several baselines and compare to the state of the art methods.
翻译:我们处理照相机在视觉定位方面的估计问题。目前基于回归法的估计方法是经过培训和评估的现场的。这些方法取决于培训数据集的协调框架,并显示各种场景和数据集之间的一般化程度较低。我们确定数据集转移学习是普遍化的一个重要障碍,并且认为转移学习是更好地重新利用表面估计模型的替代办法。我们修订了用于分类的域适应技术,并将其推广到照相机的预测,这是一项多重回归性任务。我们开发了一个深层次的适应网络,用于学习场景差异图像的展示,并利用对抗性学习来生成模型传输的这种展示。我们用自我监督的学习和适应性理论来丰富网络,以验证两个特定场景中图像的分布。我们用两个公共数据集,即剑桥地标和7Sceen的网络来评估我们的网络,展示其优于几个基线的优势,并与艺术方法的状态进行比较。