Monocular visual-inertial odometry (VIO) is a critical problem in robotics and autonomous driving. Traditional methods solve this problem based on filtering or optimization. While being fully interpretable, they rely on manual interference and empirical parameter tuning. On the other hand, learning-based approaches allow for end-to-end training but require a large number of training data to learn millions of parameters. However, the non-interpretable and heavy models hinder the generalization ability. In this paper, we propose a fully differentiable, and interpretable, bird-eye-view (BEV) based VIO model for robots with local planar motion that can be trained without deep neural networks. Specifically, we first adopt Unscented Kalman Filter as a differentiable layer to predict the pitch and roll, where the covariance matrices of noise are learned to filter out the noise of the IMU raw data. Second, the refined pitch and roll are adopted to retrieve a gravity-aligned BEV image of each frame using differentiable camera projection. Finally, a differentiable pose estimator is utilized to estimate the remaining 3 DoF poses between the BEV frames: leading to a 5 DoF pose estimation. Our method allows for learning the covariance matrices end-to-end supervised by the pose estimation loss, demonstrating superior performance to empirical baselines. Experimental results on synthetic and real-world datasets demonstrate that our simple approach is competitive with state-of-the-art methods and generalizes well on unseen scenes.
翻译:视觉- 视觉- 视觉- 视觉- 视觉- 视觉测量( VIO) 是机器人和自主驱动中的一个关键问题 。 传统方法在过滤或优化的基础上解决了这个问题 。 传统方法在完全可以解释的同时, 依靠人工干扰和实验参数调整 。 另一方面, 学习方法允许端到端培训, 但需要大量的培训数据来学习数以百万计的参数 。 然而, 不解释的和重的模型会阻碍一般化能力 。 在本文中, 我们为具有本地平板运动的机器人提出了完全不同和可解释的鸟眼- 视( BEV) 基 VIO 模型, 可以在没有深神经网络的情况下加以训练。 具体地说, 我们首先采用不突出的 Kalman 过滤器作为不同的层来预测投影和滚动。 噪音的共变形矩阵采用精细的投影和滚动式来检索每个框架的重力BEVEV图像 。 最后, 一个不同的姿势显示器应用不同姿势的姿势显示器用来估计简单、 的轨道- 显示我们最后的变形的变形矩阵的变形模型 。