In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a multi-view stereo setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other multi-view stereo methods, MonoRec is able to reconstruct both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-the-art performance compared to both multi-view and single-view methods. With the model trained on KITTI, we further demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera. Code and related materials will be available at https://vision.in.tum.de/research/monorec.
翻译:在本文中,我们提出MonoRec, 这是一种半监督的单层密集重建结构, 用来预测动态环境中单个移动相机的深度地图。 MonoRec 是基于一个多视图立体设置, 将多个连续图像的信息编码成一个成本体积。 要处理现场的动态对象, 我们引入了一个MaskModule, 通过利用成本量中编码的光度不一致来预测物体的移动面罩。 与其他多视图立体方法不同, MonoRec 能够利用预测的面具来重建静态和移动对象。 此外, 我们提出了一个新型多阶段培训计划, 配有半监督的损失配制, 不需要 LIDAR 深度值。 我们仔细评估 KITTI 数据集上的 MonicaRec, 并显示它比多视图和单一视图方法都达到最新性能。 我们通过在 KITTI 上培训的模型, 我们进一步证明MonoRec 能够对牛津机器人数据集和由手持相机记录的更具挑战性的 TUM- Monos 数据集进行概括。 将可在 http:// secodection/ 有关材料上查阅 。