The combination of LiDAR and camera modalities is proven to be necessary and typical for 3D object detection according to recent studies. Existing fusion strategies tend to overly rely on the LiDAR modal in essence, which exploits the abundant semantics from the camera sensor insufficiently. However, existing methods cannot rely on information from other modalities because the corruption of LiDAR features results in a large domain gap. Following this, we propose CrossFusion, a more robust and noise-resistant scheme that makes full use of the camera and LiDAR features with the designed cross-modal complementation strategy. Extensive experiments we conducted show that our method not only outperforms the state-of-the-art methods under the setting without introducing an extra depth estimation network but also demonstrates our model's noise resistance without re-training for the specific malfunction scenarios by increasing 5.2\% mAP and 2.4\% NDS.
翻译:最近的研究表明,激光和相机模态的组合对于3D物体检测是必要的和典型的。现有的融合策略往往过分依赖激光模态,使摄像头传感器中丰富的语义信息得不到充分的利用。然而,现有的方法无法利用其他模态的信息,因为激光特征的损坏会导致很大的领域差距。因此,我们提出了CrossFusion,这是一种更强大和抗噪声的方案,利用设计的交叉模态互补策略充分利用了摄像头和激光特征。我们进行了广泛的实验证明,我们的方法不仅在不引入额外的深度估计网络的情况下优于现有的最先进方法,而且还通过提高5.2% mAP和2.4% NDS来展示了我们的模型对噪声的抗性,而无需重新训练特定的故障场景。