Monocular depth estimation is an important task that can be applied to many robotic applications. Existing methods focus on improving depth estimation accuracy via training increasingly deeper and wider networks, however these suffer from large computational complexity. Recent studies found that edge information are important cues for convolutional neural networks (CNNs) to estimate depth. Inspired by the above observations, we present a novel lightweight Edge Guided Depth Estimation Network (EGD-Net) in this study. In particular, we start out with a lightweight encoder-decoder architecture and embed an edge guidance branch which takes as input image gradients and multi-scale feature maps from the backbone to learn the edge attention features. In order to aggregate the context information and edge attention features, we design a transformer-based feature aggregation module (TRFA). TRFA captures the long-range dependencies between the context information and edge attention features through cross-attention mechanism. We perform extensive experiments on the NYU depth v2 dataset. Experimental results show that the proposed method runs about 96 fps on a Nvidia GTX 1080 GPU whilst achieving the state-of-the-art performance in terms of accuracy.
翻译:现有方法侧重于通过日益深入和更广泛的网络培训来提高深度估计准确性,然而,这些方法具有巨大的计算复杂性。最近的研究发现,边缘信息是进化神经网络(CNNs)进行深度评估的重要线索。受上述观察的启发,我们在本研究中提出了一个新型的轻量环境引导深度估计网络(EGD-Net),特别是,我们从一个轻量的编码器脱coder-decoder结构开始,并嵌入一个边缘指导分支,作为输入图象梯度和从主干骨中采集的多尺度特征图以学习边缘关注特征。为了汇总背景信息和边缘关注特征,我们设计了一个基于变压器的特征聚合模块(TRFA)。TRFA通过跨保护机制捕捉了背景信息与边视特征之间的长期依赖性关系。我们在NYU深度V2数据集上进行了广泛的实验。实验结果显示,拟议的方法在Nvidia GTX 1080 GPUPU中以状态精确性运行约96英尺。