Most approaches for semantic segmentation use only information from color cameras to parse the scenes, yet recent advancements show that using depth data allows to further improve performances. In this work, we focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task, and we propose to employ depth information by embedding it in the positional encoding. Effectively, we extend the network to multimodal data without adding any parameters and in a natural way that makes use of the strength of transformers' self-attention modules. We also investigate the idea of performing cross-modality operations inside the attention module, swapping the key inputs between the depth and color branches. Our approach consistently improves performances on the Cityscapes benchmark.
翻译:多数语义分割法只使用彩色相机的信息来分析场景,但最近的进展显示,使用深度数据可以进一步改善性能。在这项工作中,我们侧重于基于变压器的深层学习结构,这些结构在分解任务上取得了最先进的表现,我们建议通过将其嵌入位置编码来利用深度信息。实际上,我们将网络扩大到多式联运数据,而没有增加任何参数,也没有自然地利用变压器自我注意模块的强度。我们还研究了在关注模块内进行跨模式操作的想法,在深度和彩色分支之间交换关键投入。我们的方法始终在改善城市景基准的性能。