3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality. However, the task is extremely challenging due to ambiguities coming from the unstructured, sparse, and uncolored nature of the 3D point clouds. A possible solution is to combine the 3D information with others coming from sensors featuring a different modality, such as RGB cameras. Recent multi-modal 3D semantic segmentation networks exploit these modalities relying on two branches that process the 2D and 3D information independently, striving to maintain the strength of each modality. In this work, we first explain why this design choice is effective and then show how it can be improved to make the multi-modal semantic segmentation more robust to domain shift. Our surprisingly simple contribution achieves state-of-the-art performances on four popular multi-modal unsupervised domain adaptation benchmarks, as well as better results in a domain generalization scenario.
翻译:3D语义分割是许多实际应用中的关键任务,例如自动驾驶、机器人和混合现实。然而,该任务由于3D点云的非结构化、稀疏和无色特性而极具挑战性。一个可能的解决方案是将3D信息与其他传感器提供的不同模态相结合,例如RGB摄像头。近期的多模态3D语义分割网络利用这些模态借助两个分支分别独立处理2D和3D信息,力求保持每个模态的优势。在这项工作中,我们首先解释了为什么这种设计选择是有效的,然后展示了如何改进它,使得多模态语义分割更加鲁棒,以应对领域转移。我们出乎意料的简单的工作在四个流行的多模态非监督领域适应基准测试中实现了最先进的性能,以及在领域泛化方案中获得更好的结果。