Without ground truth supervision, self-supervised depth estimation can be trapped in a local minimum due to the gradient-locality issue of the photometric loss. In this paper, we present a framework to enhance depth by leveraging semantic segmentation to guide the network to jump out of the local minimum. Prior works have proposed to share encoders between these two tasks or explicitly align them based on priors like the consistency between edges in the depth and segmentation maps. Yet, these methods usually require ground truth or high-quality pseudo labels, which may not be easily accessible in real-world applications. In contrast, we investigate self-supervised depth estimation along with a segmentation branch that is supervised with noisy labels provided by models pre-trained with limited data. We extend parameter sharing from the encoder to the decoder and study the influence of different numbers of shared decoder parameters on model performance. Also, we propose to use cross-task information to refine current depth and segmentation predictions to generate pseudo-depth and semantic labels for training. The advantages of the proposed method are demonstrated through extensive experiments on the KITTI benchmark and a downstream task for endoscopic tissue deformation tracking.
翻译:在没有地面真值监督的情况下,自监督深度估计可能会因为光度损失的梯度局部性问题而陷入局部最小值。在这篇论文中,我们提出了一个框架,利用语义分割来引导网络跳出局部最小值以增强深度估计。之前的工作已经提出了通过共享编码器或显式地基于优先信息,如深度和分割地图中边缘的一致性,来对这两个任务进行对齐。然而,这些方法通常需要地面真值或高质量的伪标签,这在实际应用中可能不容易获得。相反,我们研究了自监督深度估计以及带有通过有限数据预训练提供噪声标签的模型进行监督的分割分支。我们将参数共享从编码器扩展到解码器,并研究了不同数量的共享解码器参数对模型性能的影响。另外,我们提出利用跨任务信息来改善当前深度和分割预测,生成伪深度和语义标签以供训练。该方法的优点通过对KITTI基准测试和内镜组织变形跟踪下游任务进行了广泛实验加以证明。