Visual-inertial localization is a key problem in computer vision and robotics applications such as virtual reality, self-driving cars, and aerial vehicles. The goal is to estimate an accurate pose of an object when either the environment or the dynamics are known. Recent methods directly regress the pose using convolutional and spatio-temporal networks. Absolute pose regression (APR) techniques predict the absolute camera pose from an image input in a known scene. Odometry methods perform relative pose regression (RPR) that predicts the relative pose from a known object dynamic (visual or inertial inputs). The localization task can be improved by retrieving information of both data sources for a cross-modal setup, which is a challenging problem due to contradictory tasks. In this work, we conduct a benchmark to evaluate deep multimodal fusion based on PGO and attention networks. Auxiliary and Bayesian learning are integrated for the APR task. We show accuracy improvements for the RPR-aided APR task and for the RPR-RPR task for aerial vehicles and hand-held devices. We conduct experiments on the EuRoC MAV and PennCOSYVIO datasets, and record a novel industry dataset.
翻译:视觉-神经本地化是计算机视觉和机器人应用,如虚拟现实、自驾驶汽车和航空飞行器中的一个关键问题。目标是在环境或动态已知时,对物体的准确布局进行估计。最近采用的方法是利用进化和时空网络直接后退。绝对回归技术预测在已知场景中图像输入的绝对摄像头构成;自计量方法进行相对的回归(RPR),预测已知物体动态(视觉或惯性输入)的相对布局(RPR)。通过检索两个数据源的信息,为跨模式设置,可以改进本地化任务。由于任务相互矛盾,这是一个具有挑战性的问题。在这项工作中,我们根据PGO和关注网络进行一项评估深度多式聚合的基准。辅助和巴耶西亚学习是非洲同行审议组任务的综合部分。我们为RPR辅助的RA任务以及RPR-RPR对航空飞行器和手持装置的任务显示了准确度的提高。我们在EuRoC MAVA和PENCOSYO数据系统进行实验。