Monocular depth estimation plays a critical role in various computer vision and robotics applications such as localization, mapping, and 3D object detection. Recently, learning-based algorithms achieve huge success in depth estimation by training models with a large amount of data in a supervised manner. However, it is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative. Unfortunately, most studies on unsupervised depth estimation explore loss functions or occlusion masks, and there is little change in model architecture in that ConvNet-based encoder-decoder structure becomes a de-facto standard for depth estimation. In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation. Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway. This densely cascaded connectivity reinforces the interconnection between decoding layers and produces high-quality multi-scale depth outputs. The experiments on two different datasets, KITTI and Make3D, demonstrate that our proposed method outperforms existing state-of-the-art unsupervised algorithms.
翻译:单眼深度估计在各种计算机视觉和机器人应用中发挥着关键作用,例如本地化、绘图和3D对象探测。最近,学习型算法通过以监督方式对大量数据进行模型培训,在深度估计中取得了巨大成功。然而,为监督培训获得密集的地面真相深度标签,以及利用单眼序列进行不受监督的深度估计,是一个大有希望的替代方法。不幸的是,大多数关于未经监督的深度估计研究探索损失功能或隐蔽面罩,而基于ConvNet的编码脱coder结构的模型结构在深度估计中几乎没有多大变化。在本论文中,我们使用一个无革命性的Swin变换器作为图像特征提取器,以便网络能够捕捉本地几何特征和用于深度估计的全球语义特征。此外,我们建议建立一个高密度的连锁多尺度网络(DCMNet),通过一个上下级级的级联动路径将每个地图直接连接到另一个不同尺度的地段图。这种不紧密的连通性连接强化了我们所拟的解层和高质量的高级数据结构。