Self-supervised learning has shown very promising results for monocular depth estimation. Scene structure and local details both are significant clues for high-quality depth estimation. Recent works suffer from the lack of explicit modeling of scene structure and proper handling of details information, which leads to a performance bottleneck and blurry artefacts in predicted results. In this paper, we propose the Channel-wise Attention-based Depth Estimation Network (CADepth-Net) with two effective contributions: 1) The structure perception module employs the self-attention mechanism to capture long-range dependencies and aggregates discriminative features in channel dimensions, explicitly enhances the perception of scene structure, obtains the better scene understanding and rich feature representation. 2) The detail emphasis module re-calibrates channel-wise feature maps and selectively emphasizes the informative features, aiming to highlight crucial local details information and fuse different level features more efficiently, resulting in more precise and sharper depth prediction. Furthermore, the extensive experiments validate the effectiveness of our method and show that our model achieves the state-of-the-art results on the KITTI benchmark and Make3D datasets.
翻译:自我监督的学习为单层深度估计显示了非常有希望的结果。场景结构和当地细节都是高质量深度估计的重要线索。最近的工作因缺乏现场结构的明确模型和对细节信息的正确处理而受到影响,从而导致预测结果中出现性能瓶颈和模糊的人工制品。我们在本文件中提议以两种有效贡献为主的频道关注深度估计网络(CADepeh-Net),其中有两个有效贡献:(1) 结构感知模块利用自我注意机制捕捉频道维度的远距离依赖性和综合区别性特征,明确增进对现场结构的认知,获得对场景结构的更好理解和丰富的特征描述。(2) 详细强调模块重新校准频道特征图,有选择地强调信息特征,目的是突出关键的地方细节,更有效地结合不同层次的特征,从而更精确和更清晰地预测我们的方法的有效性,并显示我们的模型实现了KITTI基准和Make3D数据集方面的最新结果。