Monocular height estimation (MHE) from remote sensing imagery has high potential in generating 3D city models efficiently for a quick response to natural disasters. Most existing works pursue higher performance. However, there is little research exploring the interpretability of MHE networks. In this paper, we target at exploring how deep neural networks predict height from a single monocular image. Towards a comprehensive understanding of MHE networks, we propose to interpret them from multiple levels: 1) Neurons: unit-level dissection. Exploring the semantic and height selectivity of the learned internal deep representations; 2) Instances: object-level interpretation. Studying the effects of different semantic classes, scales, and spatial contexts on height estimation; 3) Attribution: pixel-level analysis. Understanding which input pixels are important for the height estimation. Based on the multi-level interpretation, a disentangled latent Transformer network is proposed towards a more compact, reliable, and explainable deep model for monocular height estimation. Furthermore, a novel unsupervised semantic segmentation task based on height estimation is first introduced in this work. Additionally, we also construct a new dataset for joint semantic segmentation and height estimation. Our work provides novel insights for both understanding and designing MHE models.
翻译:遥感图像的单体高度估计(MHE)在为快速应对自然灾害而高效生成3D城市模型方面具有很大潜力。大多数现有工程都追求更高的性能。然而,几乎没有研究探索MHE网络的可解释性。在本文中,我们的目标是探索深神经网络如何从单一单体图像中预测高度。为了全面理解MHE网络,我们提议从多个层次来解释这些网络:(1) 神经网络:单位级剖析;探索所学深层演示的语义和高度选择性;(2) 实例:对象级解释。研究不同语义类、尺度和空间背景对高度估计的影响;(3) 属性:像素级分析。了解哪些输入像素对于高度估计很重要。根据多层次解释,我们建议一个不相交织的潜伏变异网络,以更紧凑、可靠和可解释的深度模型来进行单体高估计。此外,基于高度估计的新型不可靠的语义分割任务首先在这项工作中引入。此外,我们还为联合测高段设计新的数据模型。