Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3L
翻译:使用多个空间模态已被证明有助于提高语义分割性能。然而,尚有几个现实世界中需要解决的挑战:(a)提高标签效率和(b)增强在测试时模态缺失的现实场景下的稳健性。为了解决这些问题,我们首先提出了一种简单而有效的多模态融合机制线性融合,即使受到有限监督,其表现也比最先进的多模态模型更好。其次,我们提出了M3L:用于屏蔽性模态学习的多模态教师,这是一种半监督框架,不仅提高了多模态性能,而且使用未标记的数据使模型在现实世界中的缺失模态场景中具有稳健性。我们创建了半监督多模态语义分割的第一个基准,并报告了对缺失模态的鲁棒性。我们的提议在最具竞争力的基线上比稳健mIoU至少提高了10%。我们的代码可在 https://github.com/harshm121/M3L 上获得。