Depth estimation and semantic segmentation play essential roles in scene understanding. The state-of-the-art methods employ multi-task learning to simultaneously learn models for these two tasks at the pixel-wise level. They usually focus on sharing the common features or stitching feature maps from the corresponding branches. However, these methods lack in-depth consideration on the correlation of the geometric cues and the scene parsing. In this paper, we first introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks through an analysis of the imaging process, then propose a Semantic Object Segmentation and Depth Estimation Network (SOSD-Net) based on the objectness assumption. To the best of our knowledge, SOSD-Net is the first network that exploits the geometry constraint for simultaneous monocular depth estimation and semantic segmentation. In addition, considering the mutual implicit relationship between these two tasks, we exploit the iterative idea from the expectation-maximization algorithm to train the proposed network more effectively. Extensive experimental results on the Cityscapes and NYU v2 dataset are presented to demonstrate the superior performance of the proposed approach.
翻译:深度估计和语义分解在现场理解中起着关键作用。 最先进的方法在像素水平上采用多任务学习,同时学习这两个任务的模式。 它们通常侧重于共享共同特征或从相应的分支中缝合特征图。 但是,这些方法缺乏对几何提示和场景分解相关性的深入考虑。 在本文中, 我们首先引入语义对象概念, 通过分析成像过程来利用这两项任务的几何关系, 然后根据对象假设提出一个语义对象分解和深度估计网络( SOSD- Net ) 。 根据我们的最佳知识, SOSD- Net 是第一个利用几何限制来同时进行单眼深度估计和语义分解的网络。 此外, 考虑到这两项任务之间的相互隐含关系, 我们利用期望- 数学算法的迭代概念来更有效地培训拟议的网络。 关于城市景象和NYU v2数据集的广泛实验结果展示了拟议方法的优劣性表现。