LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.
翻译:LIDAR点云已成为自主驱动中最常见的数据源。 但是,由于点云的广度,无法在具体情况下准确和可靠地探测到。 由于图像与点云的互补性,图像正在得到越来越多的关注。虽然有些成功,但现有的聚变方法要么是硬聚变,要么不是直接融合。在本文件中,我们提议了一个通用的3D探测框架,称为MMMFusion,使用多模式特性。框架的目的是在LIDAR和图像之间实现准确的融合,以便在复杂的场景中改进3D探测。我们的框架由两个不同的流组成:LIDAR流和摄像流,它们可以与任何单一模式特征提取网络兼容。LIDAR流中的Voxel局部感知模块可以增强本地特征的体现,然后是多模式的变异模块有选择地将不同流的特性输出组合起来,以便实现更好的融合。广泛的实验表明,我们的框架不仅超越了现有的基准,而且改进了它们的探测,特别是探测到KITTI基准上的骑手和行者,它们可以与任何单一模式特征提取网络,可以与任何单一式特征提取的提取能力,从而更稳健健健地推进工作。</s>