Identifying the camera pose for a given image is a challenging problem with applications in robotics, autonomous vehicles, and augmented/virtual reality. Lately, learning-based methods have shown to be effective for absolute camera pose estimation. However, these methods are not accurate when generalizing to different domains. In this paper, a domain adaptive training framework for absolute pose regression is introduced. In the proposed framework, the scene image is augmented for different domains by using generative methods to train parallel branches using Barlow Twins objective. The parallel branches leverage a lightweight CNN-based absolute pose regressor architecture. Further, the efficacy of incorporating spatial and channel-wise attention in the regression head for rotation prediction is investigated. Our method is evaluated with two datasets, Cambridge landmarks and 7Scenes. The results demonstrate that, even with using roughly 24 times fewer FLOPs, 12 times fewer activations, and 5 times fewer parameters than MS-Transformer, our approach outperforms all the CNN-based architectures and achieves performance comparable to transformer-based architectures. Our method ranks 2nd and 4th with the Cambridge Landmarks and 7Scenes datasets, respectively. In addition, for augmented domains not encountered during training, our approach significantly outperforms the MS-transformer. Furthermore, it is shown that our domain adaptive framework achieves better performance than the single branch model trained with the identical CNN backbone with all instances of the unseen distribution.
翻译:确定特定图像的相机形状是一个具有挑战性的问题。 最近,基于学习的方法显示,对于绝对相机来说是有效的。 但是,这些方法在概括到不同领域时并不准确。 在本文中,引入了一个绝对回归的域适应培训框架。 在拟议框架中,通过使用“Barlow Twins”目标来培训平行分支,使场景图像在不同的领域得到增强,使用“Barlow Twins”目标来培训平行分支。平行分支利用了一个轻巧的CNN绝对代表回溯式结构。此外,还调查了将空间和频道注意力纳入回溯式预测的回溯式头部的效率。我们的方法以两个数据集、剑桥地标和7Scenes来评估。结果显示,即使使用大约24倍的FLOP、12倍的激活和5倍的参数,我们的方法超越了所有基于CNN的模型架构,并取得了与基于变压器的架构类似的性能。我们的方法在剑桥Landmarks和7Scenes 中增加了第二位, 在经过培训的域域域域中, 也大大地展示了我们Stradestref 的Stradeal 的Strade