Recently, fusing the LiDAR point cloud and camera image to improve the performance and robustness of 3D object detection has received more and more attention, as these two modalities naturally possess strong complementarity. In this paper, we propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion~(CB-Fusion) module and a Multi-Modal Consistency~(MC) loss. More concretely, the proposed CB-Fusion module enhances point features with plentiful semantic information absorbed from the image features in a cascade bi-directional interaction fusion manner, leading to more powerful and discriminative feature representations. The MC loss explicitly guarantees the consistency between predicted scores from two modalities to obtain more comprehensive and reliable confidence scores. The experimental results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods. Besides, we emphasize a critical but easily overlooked problem, which is to explore the performance and robustness of a 3D detector in a sparser scene. Extensive experiments present that EPNet++ outperforms the existing SOTA methods with remarkable margins in highly sparse point cloud cases, which might be an available direction to reduce the expensive cost of LiDAR sensors. Code is available at: https://github.com/happinesslz/EPNetV2.
翻译:最近,利用LIDAR点云和相机图像来提高3D对象探测的性能和稳健性,引起了越来越多的关注,因为这两种模式自然具有很强的互补性。在本文件中,我们提议采用EPNet++,用于多式3D对象探测,方法是引入一个新型的Cascade双向融合模块(CB-Fusion)和多式相交连接~(MC)损失。更具体地说,拟议的CB-Fusion模块增强了点特征,通过双向互动聚合方式从图像特征中吸收了大量的语义信息,从而导致更强大和有区别性地展示特征。在本文中,MCD++++建议采用两种模式的预测分数,以获得更全面和可靠的信任分数。KITTI、JRDB和SUN-RGBD数据集的实验结果表明,EPNet++优于最新版方法。我们强调一个关键但容易忽略的问题,这就是探索3DRAB网络组合互动方式的性能和稳健性和稳健性性,在现有的SODRVRM/Smlormal Sloadroad 上,这是目前可以降低成本。