周围面: 自我监督多镜头多镜头深度估计的环绕视图 (SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation)

Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric constraints enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt the two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view ego-motion consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes.

翻译：对图像的深度估计是自动驾驶的3D感知的基本步骤,是替代LIDAR等昂贵深度传感器的一种经济经济的替代方法。时间光度测量限制使得能够在没有标签的情况下进行自我监督的深度估计,从而进一步便利其应用。然而,大多数现有方法仅根据每个单视图像预测深度,忽视了周围多个摄像头之间的相关关系,而现代自行驾驶飞行器通常可以使用这些摄像头。在本文件中,我们建议采用“环绕截面图”方法,纳入从多个周围视图中获得的信息,以预测各摄影机的深度。具体地说,我们使用一个联合网络,处理所有周围的视图,并提议一个交叉视图变换器,以便有效地将信息从多个视图中整合起来。我们采用交叉视图自我意识自我意识的自我意识,以便有效地使多摄像头地图之间的全球互动。不同于自我超超超视镜镜的深度估计,我们能够根据多摄像头的外观预测真实世界规模。为了实现这一目标,我们采用了两框架结构从感测到规模的深度,以便预设模型。我们更进一步,而不是预测每个自我定位的自我定位,我们每个摄像头的自我定位到每个自我定位的自我定位,我们实现每个自我定位的自我定位的自我定位,我们每个摄像的自我定位的自我定位的自我定位的自我定位,我们可以实现每个摄像的自我定位到每个自我定位的自我定位的自我定位的自我定位的自我定位到每个摄像的自我定位的自我定位的自我定位的自我定位的自我定位的自我定位的自我定位,我们实现到每个摄感。