Monocular visual-inertial odometry (VIO) is a critical problem in robotics and autonomous driving. Traditional methods solve this problem based on filtering or optimization. While being fully interpretable, they rely on manual interference and empirical parameter tuning. On the other hand, learning-based approaches allow for end-to-end training but require a large number of training data to learn millions of parameters. However, the non-interpretable and heavy models hinder the generalization ability. In this paper, we propose a fully differentiable, interpretable, and lightweight monocular VIO model that contains only 4 trainable parameters. Specifically, we first adopt Unscented Kalman Filter as a differentiable layer to predict the pitch and roll, where the covariance matrices of noise are learned to filter out the noise of the IMU raw data. Second, the refined pitch and roll are adopted to retrieve a gravity-aligned BEV image of each frame using differentiable camera projection. Finally, a differentiable pose estimator is utilized to estimate the remaining 4 DoF poses between the BEV frames. Our method allows for learning the covariance matrices end-to-end supervised by the pose estimation loss, demonstrating superior performance to empirical baselines. Experimental results on synthetic and real-world datasets demonstrate that our simple approach is competitive with state-of-the-art methods and generalizes well on unseen scenes.
翻译:视觉- 视觉- 视觉- 视觉- 视觉- 视觉- 视觉测量( VIO) 是机器人和自主驱动中的一个关键问题。 传统方法在过滤或优化的基础上解决了这个问题。 传统方法在完全可解释的同时, 依靠人工干扰和实验参数调整。 另一方面, 学习方法允许端到端培训, 但需要大量的培训数据来学习数以百万计的参数。 但是, 不解释的和重的模型阻碍着一般化能力。 在本文中, 我们提出了一个完全不同、 可解释的和轻巧的单体VIO模型, 它只包含4个可训练的参数。 具体地说, 我们首先采用不精通的卡尔曼过滤器作为不同的层来预测音频和滚动。 在那里, 噪音的共变式矩阵可以用来过滤IMUMU 原始数据的噪音。 其次, 精细的投放和滚动式模型可以利用不同镜头来检索每个框架的重力校准 BEV 图像。 最后, 我们的方法可以用来用简单的变换模型来了解 BEV 框架之间其余的4 DoF 。 我们的方法可以用来学习高的实验性模型的实验性模型, 演示了我们 的实验性模型的实验性模型的模型的模型 展示了我们 的实验性能的实验性模型的模型的模型 展示了 展示了 。