Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.
翻译:多传感器聚合对于准确和可靠的自主驱动系统至关重要。 最近的方法基于点级融合: 增加激光雷达点云, 并配有相机功能。 然而, 相机到激光雷达投影会丢弃相机特征的语义密度, 妨碍这些方法的有效性, 特别是对于以语义为导向的任务( 如 3D 场点分割) 。 在本文中, 我们打破了这个与BEVFusion( 一个高效和通用的多任务多传感器聚合框架) 的根深蒂固的公约。 它统一了共享鸟眼视图( BEV) 代表空间的多模式性能, 这很好地保存了几何和语义信息。 为了实现这一点, 我们诊断并提升了这些方法的有效性, 特别是对于以语义为导向的任务( 如 3D 场隔离 3D ), 将拉近40x 。 BEVFusion 从根本上是任务-, 并且无缝地支持不同的 3D 感知任务, 几乎没有建筑变化。 它确立了关于核巡视的艺术的新状态, 达到1.3%的高度, 的MAP 3D 和ND 4D 4, 4D 的测算为13 4D 4, 4, 4D 4D 4D 4D 低的 4, 4x 4x 4x 4, 4, 4x 4x 4x 4x 4x 4x 4x 4, 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4。