Robust environment perception for autonomous vehicles is a tremendous challenge, which makes a diverse sensor set with e.g. camera, lidar and radar crucial. In the process of understanding the recorded sensor data, 3D semantic segmentation plays an important role. Therefore, this work presents a pyramid-based deep fusion architecture for lidar and camera to improve 3D semantic segmentation of traffic scenes. Individual sensor backbones extract feature maps of camera images and lidar point clouds. A novel Pyramid Fusion Backbone fuses these feature maps at different scales and combines the multimodal features in a feature pyramid to compute valuable multimodal, multi-scale features. The Pyramid Fusion Head aggregates these pyramid features and further refines them in a late fusion step, incorporating the final features of the sensor backbones. The approach is evaluated on two challenging outdoor datasets and different fusion strategies and setups are investigated. It outperforms recent range view based lidar approaches as well as all so far proposed fusion strategies and architectures.
翻译:对自主车辆的强力环境认识是一项巨大的挑战,它使各种传感器,如照相机、激光雷达和雷达成为关键。在理解所记录的传感器数据的过程中,3D语义分解起着重要作用。因此,这项工作为Lidar和照相机提供了一个基于金字塔的深层聚合结构,用于改善3D语义分解交通场景;单个传感器主干柱提取照相机图像和利达尔点云的特征图;一个新颖的金字塔组合这些地貌图在不同尺度上结合这些地貌图,并将一个地貌金字塔中的多式特征结合起来,以计算有价值的多式联运、多尺度的特征。金字塔形首集成这些金字塔特征,并在延迟的融合步骤中进一步完善这些特征,纳入传感器骨架的最后特征。该方法在两个具有挑战性的户外数据集和不同的聚变战略和构图上进行了评估。该方法超越了基于最近范围视图的Lidar方法和所有拟议的远处融合战略和结构。