Depth information is essential for on-board perception in autonomous driving and driver assistance. Monocular depth estimation (MDE) is very appealing since it allows for appearance and depth being on direct pixelwise correspondence without further calibration. Best MDE models are based on Convolutional Neural Networks (CNNs) trained in a supervised manner, i.e., assuming pixelwise ground truth (GT). Usually, this GT is acquired at training time through a calibrated multi-modal suite of sensors. However, also using only a monocular system at training time is cheaper and more scalable. This is possible by relying on structure-from-motion (SfM) principles to generate self-supervision. Nevertheless, problems of camouflaged objects, visibility changes, static-camera intervals, textureless areas, and scale ambiguity, diminish the usefulness of such self-supervision. In this paper, we perform monocular depth estimation by virtual-world supervision (MonoDEVS) and real-world SfM self-supervision. We compensate the SfM self-supervision limitations by leveraging virtual-world images with accurate semantic and depth supervision and addressing the virtual-to-real domain gap. Our MonoDEVSNet outperforms previous MDE CNNs trained on monocular and even stereo sequences.
翻译:单深度估计(MDE)非常有吸引力,因为它允许在不进一步校准的情况下在直接的像素通信上出现和深度,而无需进一步校准。最佳MDE模型的基础是以监督方式培训的进化神经网络(CNNs),即假设像素地面真实性(GT) 。通常,该GT是在培训时通过校准的多式多式传感器组合获得的。然而,在培训时仅使用单眼系统就更便宜、更可伸缩。这可以通过依靠从动作结构(SfM)原则产生自上而下的自上视野。然而,伪装物体、可见度变化、静电摄像头间隔、无纹区和范围模糊性的问题,削弱了这种自我监督的效用。在本文中,我们通过虚拟世界监督(MODEVS)和现实世界SfMM(SfMM)的自我监督来进行单眼深度估计。我们通过将虚拟世界图像(SfMMM)的自我监督来弥补SfM(S)的自我校视限制。我们利用虚拟世界图像(甚至将虚拟-del-DS-S-SR)图像转换到我们以前的磁-S-real-real-real-real-real-st-st-st-tracal-stal-smal-stal-smal-smstrystrystrystrystrystryal drutdro druts druts druts druts)的深度。