In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a MVS setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other MVS methods, MonoRec is able to predict accurate depths for both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-the-art performance compared to both multi-view and single-view methods. With the model trained on KITTI, we further demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera. Training code and pre-trained model will be published soon.
翻译:在本文中,我们提出MonoRec, 这是一种半监督的单眼密集重建结构, 用来预测动态环境中单个移动相机的深度地图。 MonoRec 是基于一个 MVS 设置, 将多个连续图像的信息编码成一个成本体积。 为了处理现场的动态物体, 我们引入了一个 MaskModule, 通过利用成本量中编码的光度不一致来预测物体的移动面罩。 与其他 MVS 方法不同, MonoRec 能够通过利用预测的面具来预测静态和移动对象的准确深度。 此外, 我们提出了一个新型的多阶段培训计划, 配有半监督的损失配制, 不需要LIDAR 深度值。 我们仔细评估 KITTI 数据集中的 MonicRec, 并显示它比多视图和单一视图方法都达到最新水平的性能。 我们通过在 KITTI 培训模型进一步证明, MonoRec 能够对牛津机器人计算机数据集和更具挑战性的TUM- Monos 数据集进行普及。 将很快由手持式相机记录下来。 培训代码和训练前的模型将很快公布。