Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.
翻译:从单一图像进行深度估计是一项重要任务,可以应用于计算机视觉的各个领域,随着进化神经网络的发展,这种深度估计已经迅速发展。在本文件中,我们提出了一个关于单层深度估计的新结构和培训战略,以进一步提高网络的预测准确性。我们部署了一个等级变压器编码器,以捕捉和传达全球背景,并设计一个轻量但强大的解码器,以便在考虑本地连接的同时绘制一份估计深度地图。通过在多尺度的地方特征和全球解码流与我们拟议的选择性特性聚合模块之间建立连接路径,网络可以整合演示并恢复精细的细节。此外,拟议的解码器显示的性能比先前提议的解析器要好,而计算的复杂性要小得多。此外,我们通过利用重要的深度观察来改进深度增强模型,改进了具体深度增强方法。我们的网络在具有挑战性的深度数据集NYU深度V2上取得了最先进的性能。已经进行了广泛的实验,以验证和显示拟议方法的有效性。最后,我们的模型显示了比其他比较模型更好的概括能力和坚固性。