Most approaches for semantic segmentation use only information from color cameras to parse the scenes, yet recent advancements show that using depth data allows to further improve performances. In this work, we focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task, and we propose to employ depth information by embedding it in the positional encoding. Effectively, we extend the network to multimodal data without adding any parameters and in a natural way that makes use of the strength of transformers' self-attention modules. We also investigate the idea of performing cross-modality operations inside the attention module, swapping the key inputs between the depth and color branches. Our approach consistently improves performances on the Cityscapes benchmark.
翻译:大多数语义分割方法仅利用彩色相机的信息来解析场景,但最近的研究表明利用深度信息可以进一步提高性能。在这项工作中,我们专注于基于变形器的深度学习架构,这些架构在分割任务上取得了最先进的性能,并且我们提出通过将深度信息嵌入位置编码来使用它。实际上,我们可以在不添加任何参数的情况下自然地将网络扩展到多模态数据,并且利用变形器自我注意力模块的优点。我们还研究了在注意模块内执行跨模态操作的想法,即在深度和颜色分支之间交换键输入。我们的方法可以在Cityscapes基准测试中始终提高性能。