Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released at \href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet}.
翻译:基于多视角摄像机的三维物体检测因其低成本而得到了广泛的应用。但通过仅依靠相机数据准确地推断三维几何信息仍然具有挑战性,这会影响模型的性能。解决这个问题的一种有希望的方法是从 LiDAR 数据中提取精确的三维几何知识。然而,不同传感器模态之间的知识迁移受到显著的模态差距的限制。在本文中,我们从体系结构设计和知识蒸馏的角度来解决这个挑战,并提出了一种新的模拟多模态三维物体检测方法,名为 BEVSimDet。我们首先介绍了一个包括 LiDAR 和相机融合的教师和一个模拟多模态学生的新框架,其中学生使用仅图像输入模拟多模态特征。为了促进有效的蒸馏,我们提出了一种模拟多模态蒸馏方案,支持同时进行同模态、异模态和多模态蒸馏。通过将它们结合在一起,BEVSimDet 可以学习更好的特征表示以用于三维物体检测,同时享受低成本的仅相机部署。在具有挑战性的 nuScenes 基准测试中的实验结果显示了 BEVSimDet 在最近的代表性方法上的有效性和优越性。源代码将在 \href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet} 发布。