Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.
翻译:多摄像头自动驾驶三维物体检测是一个具有挑战性的问题,引起了学术界和行业界的广泛关注。在基于视觉技术的方法中,所遇到的障碍涉及从RGB图像中精确提取几何感知特征。最近的方法利用了在深度相关任务上预先训练的几何意识图像骨干以获取空间信息。然而,这些方法忽略了视角变换这一关键因素,导致空间知识在图像骨干和视角变换之间的不匹配而表现不佳。为了解决这个问题,我们提出了一种新颖的几何感知预训练框架,称为GAPretrain。我们的方法通过在预训练阶段使用几何丰富模态作为指导来向相机网络注入空间和结构线索。不同模态之间的模态专属属性的转移是非常困难的,但我们通过使用统一的俯视图(BEV)表示和从LiDAR点云中提取的结构线索来弥补这个差距,并促进预训练过程。GAPretrain是一个灵活应用于多个最先进检测器的即插即用解决方案。我们的实验证明了所提出方法的有效性和泛化能力。我们使用BEVFormer方法在nuScenes验证集上获得了46.2的mAP指标和55.5的NDS指标,分别提升了2.7和2.1个百分点。我们还在不同的图像骨干和视角变换上进行实验,以验证我们方法的有效性。代码将发布在https://github.com/OpenDriveLab/BEVPerception-Survey- Recipe上。