Current self-supervised methods for monocular depth estimation are largely based on deeply nested convolutional networks that leverage stereo image pairs or monocular sequences during a training phase. However, they often exhibit inaccurate results around occluded regions and depth boundaries. In this paper, we present a simple yet effective approach for monocular depth estimation using stereo image pairs. The study aims to propose a student-teacher strategy in which a shallow student network is trained with the auxiliary information obtained from a deeper and more accurate teacher network. Specifically, we first train the stereo teacher network by fully utilizing the binocular perception of 3-D geometry and then use the depth predictions of the teacher network to train the student network for monocular depth inference. This enables us to exploit all available depth data from massive unlabeled stereo pairs. We propose a strategy that involves the use of a data ensemble to merge the multiple depth predictions of the teacher network to improve the training samples by collecting non-trivial knowledge beyond a single prediction. To refine the inaccurate depth estimation that is used when training the student network, we further propose stereo confidence-guided regression loss that handles the unreliable pseudo depth values in occlusion, texture-less region, and repetitive pattern. To complement the existing dataset comprising outdoor driving scenes, we built a novel large-scale dataset consisting of one million outdoor stereo images taken using hand-held stereo cameras. Finally, we demonstrate that the monocular depth estimation network provides feature representations that are suitable for high-level vision tasks. The experimental results for various outdoor scenarios demonstrate the effectiveness and flexibility of our approach, which outperforms state-of-the-art approaches.
翻译:目前用于单眼深度估算的自监督方法,主要基于深巢式的声波网络,在培训阶段利用立体图像配对或单眼序列,但往往在隐蔽区域和深度边界周围出现不准确的结果。在本文件中,我们提出了一个使用立体图像配对进行单眼深度估算的简单而有效的方法。本研究报告的目的是提出一个学生-教师战略,通过从更深、更准确的教师网络获得的辅助信息,对浅学生网络进行培训。具体地说,我们首先通过充分利用3D几何学的双眼观点来培训立体教师网络,然后利用教师网络的深度预测来培训学生网络进行单眼深度推断。这使我们能够利用大规模无标签立体立体图像中所有可用的深度数据。我们提出了一个战略,即使用一个高深层次的数据组合来整合教师网络的多重深度预测,通过收集非全局性知识来改进培训样本。为了改进在培训学生网络中使用的不准确深度估算,我们进一步建议采用立体-直观性的直观度图像比值,我们用一个直观性的直观的直径直径直径直径直径直径直径直径直径直径直径直径直径直径直的图像平方图像平方的直径直径直径直径直径平方的图像平方的图像平面,我们用来展示的图像直径直径直径直径直径直径直的图像方的图像。