We describe a method to infer dense depth from camera motion and sparse depth as estimated using a visual-inertial odometry system. Unlike other scenarios using point clouds from lidar or structured light sensors, we have few hundreds to few thousand points, insufficient to inform the topology of the scene. Our method first constructs a piecewise planar scaffolding of the scene, and then uses it to infer dense depth using the image along with the sparse points. We use a predictive cross-modal criterion, akin to `self-supervision,' measuring photometric consistency across time, forward-backward pose consistency, and geometric compatibility with the sparse point cloud. We also launch the first visual-inertial + depth dataset, which we hope will foster additional exploration into combining the complementary strengths of visual and inertial sensors. To compare our method to prior work, we adopt the unsupervised KITTI depth completion benchmark, and show state-of-the-art performance on it. Code available at: https://github.com/alexklwong/unsupervised-depth-completion-visual-inertial-odometry.
翻译:我们用摄像机运动和稀薄深度的深度来估计摄像机运动和稀薄深度的深度。 与使用激光雷达或结构化光传感器的点云的其他情景不同,我们有几百至几千个点,不足以告知现场的地形。 我们的方法首先构建了一块片状平面的平板架子,然后用图像和稀薄点来测量密度的深度。 我们使用一种预测性的跨模式标准,类似于“自我监督”,衡量不同时间的光度一致性、前向后向的面面面面面面貌一致性和与稀薄点云的几何相容性。 我们还启动了第一个视觉-神经+深度数据集,我们希望这将促进更多的探索,将视觉和惯性传感器的互补优势结合起来。 为了将我们的方法与先前的工作进行比较,我们采用了未加监视的KITTI深度完成基准,并展示其状态性能。 代码见: https://github.com/al-exklong/unsuvivivivivid-imal-tial-artial-stal-tial-stal-stal-tial-tial-tial-tial-stal-statal.