Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released.
翻译:多视角基于相机的3D目标检测因其低成本而备受欢迎。但是,仅从相机数据准确地推断3D几何仍然具有挑战性,这会影响模型性能。解决这个问题的一种有前途的方法是从LiDAR数据中提取精确的3D几何知识。但是,不同传感器模态之间的知识转移受到显着的模态差距的限制。在本文中,我们从架构设计和知识蒸馏的角度来解决这个挑战,并提出了一种新的模拟多模式3D目标检测方法,名为BEVSimDet。我们首先引入了一种新颖的框架,包括一个基于LiDAR和相机的融合Teacher和一个模拟的多模式Student,其中Student利用仅图像输入的多模式特征进行模拟。为了便于有效的蒸馏,我们提出了一种支持同时进行内模态、跨模态和多模态蒸馏的模拟多模式蒸馏方案。通过将它们结合在一起,BEVSimDet可以学习更好的特征表示,用于3D目标检测,同时享受成本效益的仅相机部署。在具有挑战性的nuScenes基准测试中的实验结果证明了BEVSimDet在最近的代表性方法中的有效性和优越性。代码将会被公布。