Scene understanding based on image segmentation is a crucial component for autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting informative features from the supplementary modality (X-modality). In this work, we propose CMX, a transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various supplements and uncertainties, we consider that comprehensive cross-modal interactions should be provided. CMX is built with two streams to extract features from RGB images and the X-modality. In each feature extraction stage, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate the feature of the current modality by combining the feature from the other modality, in spatial- and channel-wise dimensions. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to mix them for the final semantic prediction. FFM is constructed with a cross-attention mechanism, which enables exchange of long-range contexts, enhancing bi-modal features globally. Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.
翻译:基于图像分割的场景理解对于自动驾驶至关重要。RGB图像的像素级语义分割可以通过利用补充模态(X模态)中的信息特征而得到提升。本文提出了CMX,一种基于Transformer的RGB-X语义分割的跨模态融合框架。我们认为为适应包含多种补充和不确定性的不同传感模态,应该提供全面的跨模态交互。CMX由两个流来提取RGB图像和X模态的特征。在每个特征提取阶段,我们设计了交叉模态特征校正模块(CM-FRM)来通过在空间和通道维度上结合来自其他模态的特征来校准当前模态的特征。通过校准的特征对,我们使用特征融合模块(FFM)将它们混合以获取最终的语义预测。FFM采用交叉注意机制构建,可以全局地增强双模态特征的交换长程上下文。广泛的实验表明,CMX适用于各种多模态组合,取得了五项RGB-Depth基准测试,以及RGB-Thermal,RGB-Polarization和RGB-LiDAR数据集的最新性能结果。此外,为了研究对稠密稀疏数据融合的普适性,我们在EventScape数据集上建立了基于RGB-Event的语义分割基准测试,其中CMX设置了新的最佳性能。 CMX的源代码可在 https://github.com/huaaaliu/RGBX_Semantic_Segmentation 上公开获取。