LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at \url{https://github.com/jialeli1/lidarseg3d}.
翻译:LiDAR 和 相机是自动驾驶中3D 语义分解的两种可用模式。 流行的 LiDAR 方法由于激光点不够,在小型和远方物体上严重受低劣分解的影响, 而强力多模式解决方案则被探索不足, 我们调查了三个关键的内在困难: 模式异质性、 有限的视觉感应场交叉和多模式数据增强。 我们提议采用多式 3D 语义分解模型(MSeg3D ), 并配有联合的现代特征提取和元与元性特征融合, 以缓解模式的异质性。 MSeg3D 的多模式融合包括基于几何性特征的语义融合、 GF- 阶段、 交叉模式特征完成和基于语义的特征融合 SUDS- SUDS- 系统。 多模式数据通过对LDAR 点、 Seget- Dlus 和 SODRisldrodal 数据进行快速化。</s>