Monocular depth estimation and semantic segmentation are two fundamental goals of scene understanding. Due to the advantages of task interaction, many works study the joint task learning algorithm. However, most existing methods fail to fully leverage the semantic labels, ignoring the provided context structures and only using them to supervise the prediction of segmentation split. In this paper, we propose a network injected with contextual information (CI-Net) to solve the problem. Specifically, we introduce self-attention block in the encoder to generate attention map. With supervision from the ground truth created by semantic labels, the network is embedded with contextual information so that it could understand the scene better, utilizing dependent features to make accurate prediction. Besides, a feature sharing module is constructed to make the task-specific features deeply fused and a consistency loss is devised to make the features mutually guided. We evaluate the proposed CI-Net on the NYU-Depth-v2 and SUN-RGBD datasets. The experimental results validate that our proposed CI-Net is competitive with the state-of-the-arts.
翻译:单心深度估计和语义分割是现场理解的两个基本目标。 由于任务互动的优势, 许多工作都研究联合任务学习算法。 但是, 大多数现有方法都未能充分利用语义标签, 忽略所提供的上下文结构, 并且只使用它们来监督对分割的预测。 在本文中, 我们建议建立一个带有背景信息的网络( CI- Net) 来解决这个问题。 具体地说, 我们将在编码器中引入自我注意块来生成引人注意的地图 。 在由语义标签创建的地面真理监督下, 网络嵌入了背景信息, 以便更好地了解现场, 利用依赖性特征做出准确的预测 。 此外, 正在构建一个地谱共享模块, 使任务特有的特征紧密结合, 并设计一致性损失来使这些特征相互指导 。 我们评估了纽约大学- Deph- v2 和 SUN- RGBD数据集上的拟议 CI- Net 。 实验结果验证了我们提议的CI- Net 与状态的竞争力。