自我监督的深度估计图像合成损失的罪孽 (On the Sins of Image Synthesis Loss for Self-supervised Depth Estimation)

Scene depth estimation from stereo and monocular imagery is critical for extracting 3D information for downstream tasks such as scene understanding. Recently, learning-based methods for depth estimation have received much attention due to their high performance and flexibility in hardware choice. However, collecting ground truth data for supervised training of these algorithms is costly or outright impossible. This circumstance suggests a need for alternative learning approaches that do not require corresponding depth measurements. Indeed, self-supervised learning of depth estimation provides an increasingly popular alternative. It is based on the idea that observed frames can be synthesized from neighboring frames if accurate depth of the scene is known - or in this case, estimated. We show empirically that - contrary to common belief - improvements in image synthesis do not necessitate improvement in depth estimation. Rather, optimizing for image synthesis can result in diverging performance with respect to the main prediction objective - depth. We attribute this diverging phenomenon to aleatoric uncertainties, which originate from data. Based on our experiments on four datasets (spanning street, indoor, and medical) and five architectures (monocular and stereo), we conclude that this diverging phenomenon is independent of the dataset domain and not mitigated by commonly used regularization techniques. To underscore the importance of this finding, we include a survey of methods which use image synthesis, totaling 127 papers over the last six years. This observed divergence has not been previously reported or studied in depth, suggesting room for future improvement of self-supervised approaches which might be impacted the finding.

翻译：从立体和单体图像中测深,对于为下游任务(如现场理解)提取三维信息至关重要。最近,基于学习的深度估算方法因其高性能和硬件选择的灵活性而得到了很大关注。然而,为监督地培训这些算法而收集地面真相数据是昂贵的或完全不可能的。这种情况表明需要采用不需要相应深度测量的替代学习方法。事实上,自我监督的深度估算学习提供了一个越来越受欢迎的替代方法。它基于这样一种想法,即观测到的框架可以从相邻框中合成,如果了解准确的现场深度,或者在本案中估计。我们从经验上表明,与共同的信念相反,图像合成的改进不需要在深度估算方面加以改进。相反,优化图像合成可导致主要预测目标(深度)的不同性绩效。我们将这一差异性现象归因于从数据中产生的偏差不确定性。根据我们对四个数据集(跨越街道、室内和医疗)和五个结构(海洋和立体)的实验,我们的结论是,这一差异性现象与通常的深度测量方法不同,我们没有提出这种精确的深度测量方法,而是在以往研究的六年才开始采用。