Although existing monocular depth estimation methods have made great progress, predicting an accurate absolute depth map from a single image is still challenging due to the limited modeling capacity of networks and the scale ambiguity issue. In this paper, we introduce a fully Visual Attention-based Depth (VADepth) network, where spatial attention and channel attention are applied to all stages. By continuously extracting the dependencies of features along the spatial and channel dimensions over a long distance, VADepth network can effectively preserve important details and suppress interfering features to better perceive the scene structure for more accurate depth estimates. In addition, we utilize geometric priors to form scale constraints for scale-aware model training. Specifically, we construct a novel scale-aware loss using the distance between the camera and a plane fitted by the ground points corresponding to the pixels of the rectangular area in the bottom middle of the image. Experimental results on the KITTI dataset show that this architecture achieves the state-of-the-art performance and our method can directly output absolute depth without post-processing. Moreover, our experiments on the SeasonDepth dataset also demonstrate the robustness of our model to multiple unseen environments.
翻译:尽管现有的单层深度估计方法取得了很大进展,但预测单一图像的准确绝对深度地图仍然具有挑战性,因为网络模型能力有限,而且规模模糊问题。在本文中,我们引入了完全视觉关注深度(VADepth)网络,将空间关注和引导关注应用到各个阶段。通过在长距离内不断提取空间和频道维度各特征的依存性,VADepth网络可以有效地保存重要细节,抑制干扰性特征,以便更好地了解场景结构,以便进行更准确的深度估计。此外,我们利用几何前线来形成规模测量模型培训的限制。具体地说,我们利用相机与图像中下方矩形区域平面相匹配的地面点之间的距离,构建了一个新的比例测量损失。 KITTI 数据集的实验结果显示,这一结构实现了最先进的性能,我们的方法可以直接输出绝对深度,而无需后处理。此外,我们在Seard Dept数据设置上进行的实验还表明,我们模型的坚固度也表明我们模型对于多种看不见环境的坚固性。