Self-supervised depth estimation has made a great success in learning depth from unlabeled image sequences. While the mappings between image and pixel-wise depth are well-studied in current methods, the correlation between image, depth and scene semantics, however, is less considered. This hinders the network to better understand the real geometry of the scene, since the contextual clues, contribute not only the latent representations of scene depth, but also the straight constraints for depth map. In this paper, we leverage the two benefits by proposing the implicit and explicit semantic guidance for accurate self-supervised depth estimation. We propose a Semantic-aware Spatial Feature Alignment (SSFA) scheme to effectively align implicit semantic features with depth features for scene-aware depth estimation. We also propose a semantic-guided ranking loss to explicitly constrain the estimated depth maps to be consistent with real scene contextual properties. Both semantic label noise and prediction uncertainty is considered to yield reliable depth supervisions. Extensive experimental results show that our method produces high quality depth maps which are consistently superior either on complex scenes or diverse semantic categories, and outperforms the state-of-the-art methods by a significant margin.
翻译:自我监督的深度估计在从未贴标签的图像序列中学习深度方面取得了巨大成功。 虽然图像和像素智慧深度之间的映射在目前的方法中得到了很好地研究, 但图像、深度和场景语义学之间的相互关系却不那么受到考虑。 这妨碍网络更好地了解现场的真实几何, 因为背景线索, 不仅有助于对场景深度的潜在表示, 而且也有助于对深度地图的直接限制。 在本文中, 我们通过为准确的自我监督深度估测提出隐含和明确的语义指南, 来利用这两个好处。 我们提出一个Smantic-aware空间地貌调整( SSFA) 计划, 以有效地将隐含的语义特征与深度特征相匹配, 用于地貌认知深度估测。 我们还提出一个语义引导排序损失, 以明确限制估计的深度地图与真实的场景背景特性相一致。 语义标签噪音和预测不确定性被认为产生可靠的深度监督。 广泛的实验结果显示, 我们的方法产生了高质量的深度地图, 无论是在复杂场景上还是以不同程度的差幅。