Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences. In particular, we propose three key contributions: (1) We transfer knowledge from features learned during self-supervised depth estimation to semantic segmentation, (2) we implement a strong data augmentation by blending images and labels using the geometry of the scene, and (3) we utilize the depth feature diversity as well as the level of difficulty of learning depth in a student-teacher framework to select the most useful samples to be annotated for semantic segmentation. We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains, and we achieve state-of-the-art results for semi-supervised semantic segmentation. The implementation is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.
翻译:为解决这一问题,我们提出了一个半监督的语义分解框架,通过自我监督的单层图像序列的单层深度估算加以强化。我们特别提出三项关键贡献:(1) 我们从自我监督深度估测期间所学到的特征中获取的知识转移到语义分解中,(2) 我们通过利用场景的几何进行图像和标签混合,实现强大的数据增强,(3) 我们利用学生-教师框架中的深度特征多样性和学习深度难度选择最有用的样本,作为语义分解的附加说明。 我们验证了在市景数据集上的拟议模型,所有三个模块都展示了显著的性能收益,我们实现了半监督语义分解的状态-艺术结果。