LiDAR and camera are two essential sensors for 3D object detection in autonomous driving. LiDAR provides accurate and reliable 3D geometry information while the camera provides rich texture with color. Despite the increasing popularity of fusing these two complementary sensors, the challenge remains in how to effectively fuse 3D LiDAR point cloud with 2D camera images. Recent methods focus on point-level fusion which paints the LiDAR point cloud with camera features in the perspective view or bird's-eye view (BEV)-level fusion which unifies multi-modality features in the BEV representation. In this paper, we rethink these previous fusion strategies and analyze their information loss and influences on geometric and semantic features. We present SemanticBEVFusion to deeply fuse camera features with LiDAR features in a unified BEV representation while maintaining per-modality strengths for 3D object detection. Our method achieves state-of-the-art performance on the large-scale nuScenes dataset, especially for challenging distant objects. The code will be made publicly available.
翻译:LiDAR 和相机是自动驾驶3D对象探测的两个基本传感器。 LiDAR 提供准确可靠的 3D 几何信息,而相机则提供丰富的色彩纹理。尽管使用这两个互补传感器越来越受欢迎,但在如何有效地将3D LiDAR点云与2D摄像图像结合方面仍然存在挑战。最近的方法侧重于点级聚变,这种聚变将LIDAR点云涂上视觉或鸟眼视图(BEV)级聚变的摄像特征,从而统一BEV 代表的多调性特征。在本文中,我们重新思考这些先前的聚变策略,并分析其信息损失和对几何和语义特征的影响。我们展示SmanticBEVFusion 以统一的 BEV 代表方式将LIDAR 特征与LIDAR 特征紧密结合,同时保持3D 对象探测的每个模式的强力。我们的方法在大型 nuScenes 数据集上取得了最先进的性能表现,特别是具有挑战性的远方物体。代码将公开提供。