Visual localization is one of the most important components for robotics and autonomous driving. Recently, inspiring results have been shown with CNN-based methods which provide a direct formulation to end-to-end regress 6-DoF absolute pose. Additional information like geometric or semantic constraints is generally introduced to improve performance. Especially, the latter can aggregate high-level semantic information into localization task, but it usually requires enormous manual annotations. To this end, we propose a novel auxiliary learning strategy for camera localization by introducing scene-specific high-level semantics from self-supervised representation learning task. Viewed as a powerful proxy task, image colorization task is chosen as complementary task that outputs pixel-wise color version of grayscale photograph without extra annotations. In our work, feature representations from colorization network are embedded into localization network by design to produce discriminative features for pose regression. Meanwhile an attention mechanism is introduced for the benefit of localization performance. Extensive experiments show that our model significantly improve localization accuracy over state-of-the-arts on both indoor and outdoor datasets.
翻译:视觉本地化是机器人和自主驱动的最重要组成部分之一。 最近,以CNN为基础的方法展示了令人振奋的成果,这些方法为端到端回归6-DoF绝对面貌提供了直接的配方。 通常还引入了几何或语义限制等额外信息来提高性能。 特别是,后者可以将高层次语义信息汇总到本地化任务中,但通常需要大量的手工说明。 为此,我们提出了一个新型的相机本地化辅助学习战略,从自我监管的演示学习任务引入针对具体场景的高层次语义。 作为一种强大的代理任务,图像色彩化任务被选为补充性任务,即无附加说明的灰色照片的像素色彩版本输出。 在我们的工作中,彩色化网络的特征表现通过设计嵌入本地化网络,以产生偏重的特征。 与此同时,引入了一个关注机制,以利本地化表现。 广泛的实验显示,我们的模型极大地提高了室内和室外数据集的本地化精度。