Depth estimation has been widely studied and serves as the fundamental step of 3D perception for autonomous driving. Though significant progress has been made for monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation, to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations. Further, the Adjacent-view Attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surround-view feature maps. Extensive experiments show that our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
翻译:深度估计已经得到广泛研究,并成为独立驾驶3D感知的基本步骤。虽然在过去几十年中在单视深度估计方面取得了显著进展,但这些尝试主要是在KITTI基准上进行的,只有前视摄像头,忽视了周围摄像头的关联性。在本文中,我们提议S3Depth,即监督周边视野深度估计的简单基线,以共同预测周围多个摄像头的深度地图。具体地貌提取模块,将CNN与变压层相结合,以进行浓缩演示。此外,近视关注机制建议使视图内和视图间特征传播成为可能。前者是通过每种视图内的自我注意模块实现的,而后者则是通过相邻的注意模块实现的,该模块计算出多个镜头的注意力,以在周围地貌地图上交换多尺度的显示。广泛的实验表明,我们的方法取得了优于现有DDD和nuScenes数据集的状态方法。</s>