Leveraging multi-modal fusion, especially between camera and LiDAR, has become essential for building accurate and robust 3D object detection systems for autonomous vehicles. Until recently, point decorating approaches, in which point clouds are augmented with camera features, have been the dominant approach in the field. However, these approaches fail to utilize the higher resolution images from cameras. Recent works projecting camera features to the bird's-eye-view (BEV) space for fusion have also been proposed, however they require projecting millions of pixels, most of which only contain background information. In this work, we propose a novel approach Center Feature Fusion (CFF), in which we leverage center-based detection networks in both the camera and LiDAR streams to identify relevant object locations. We then use the center-based detection to identify the locations of pixel features relevant to object locations, a small fraction of the total number in the image. These are then projected and fused in the BEV frame. On the nuScenes dataset, we outperform the LiDAR-only baseline by 4.9% mAP while fusing up to 100x fewer features than other fusion methods.
翻译:利用多式聚合,特别是相机和激光雷达之间的混合,对于为自动车辆建立准确和健全的三维物体探测系统至关重要。直到最近,点装饰方法,即点云与相机特征相加,一直是外地的主要方法。然而,这些方法未能利用摄像头的更高分辨率图像。最近还提议将相机特征投射到鸟眼视(BEV)空间进行聚合的工作,但它们需要投射数百万像素,其中多数只包含背景资料。在这项工作中,我们提议采用新型方法,即功能变异中心(CFF),在相机和激光雷达流中利用中心探测网络确定相关物体位置。我们随后使用中心探测方法确定与物体位置相关的像素特征的位置,这是图像中总数中的一小部分。然后预测并结合到BEV框架。在nuScenes数据集中,我们用4.9%至100x的其他特性比其他特性少。我们用4.9%的模型比LIDAR基准高出4.9% mAP。