Image based localization is one of the important problems in computer vision due to its wide applicability in robotics, augmented reality, and autonomous systems. There is a rich set of methods described in the literature how to geometrically register a 2D image w.r.t.\ a 3D model. Recently, methods based on deep (and convolutional) feedforward networks (CNNs) became popular for pose regression. However, these CNN-based methods are still less accurate than geometry based methods despite being fast and memory efficient. In this work we design a deep neural network architecture based on sparse feature descriptors to estimate the absolute pose of an image. Our choice of using sparse feature descriptors has two major advantages: first, our network is significantly smaller than the CNNs proposed in the literature for this task---thereby making our approach more efficient and scalable. Second---and more importantly---, usage of sparse features allows to augment the training data with synthetic viewpoints, which leads to substantial improvements in the generalization performance to unseen poses. Thus, our proposed method aims to combine the best of the two worlds---feature-based localization and CNN-based pose regression--to achieve state-of-the-art performance in the absolute pose estimation. A detailed analysis of the proposed architecture and a rigorous evaluation on the existing datasets are provided to support our method.
翻译:以图像为基础的本地化是计算机视觉中的一个重要问题,因为它广泛适用于机器人、增强的现实和自主系统。文献中描述了一系列丰富的方法,说明如何对二维图像进行几何登记。最近,基于深层(和进进化)进料前网络(CNNs)的方法越来越受欢迎,从而造成倒退。然而,这些基于CNN的方法尽管快速和记忆高效,却比基于几何的方法更不准确。在这项工作中,我们设计了一个深厚的神经网络结构,其基础是稀薄的特征描述仪,以估计图像的绝对面貌。我们选择使用稀薄特征描述仪有两个主要优势:首先,我们的网络比文献中为这项任务提议的CNN系统要小得多,从而使我们的方法更有效率和可缩放。第二,更重要的是,使用稀有特征可以以合成的观点来增加培训数据,从而大大改进普通化性地表现到视觉的状态。因此,我们提出的方法旨在将两种世界最优的、以绝对性标定的系统化模型与以当前精确的状态分析相结合。