基于几何感知的先预训练模型用于视觉中心的三维物体检测 (Geometric-aware Pretraining for Vision-centric 3D Object Detection)

Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.

翻译：自动驾驶方面的多摄像头三维物体检测是一个备受关注的挑战性问题，引起了学术界和工业界的广泛关注。基于视觉的技术中遇到的障碍是从 RGB 图像中精确提取具有几何意识的特征。近期的研究采用了在与深度相关的任务上预训练的几何感知图像骨干来获取空间信息。然而，这些方法忽略了视角变换这一重要方面，造成了空间知识在图像骨干和视角变换之间的失配，从而导致性能不足。为了解决这个问题，我们提出了一个新颖的几何感知预训练框架 GAPretrain，这个方法通过在预训练阶段中使用几何丰富的模态来为摄像头网络引入空间和结构线索。不同模态之间的模态特定属性的转移并不是微不足道的，但我们通过使用统一的鸟瞰图 (BEV) 表示和从 LiDAR 点云中获得的结构线索来弥合这个鸿沟，以促进预训练过程。GAPretrain 是一种即插即用的解决方案，可以灵活地应用于多个最先进的检测器。我们的实验证明了所提出方法的有效性和泛化能力。使用 BEVFormer 方法，我们在 nuScenes 验证集上实现了 46.2 的 mAP 和 55.5 的 NDS，分别比之前提高了 2.7 和 2.1 个点。我们还对各种图像骨干和视角变换进行了实验，以验证我们方法的有效性。代码将在 https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe 释放。