Most of existing RGB-D salient object detection (SOD) methods follow the CNN-based paradigm, which is unable to model long-range dependencies across space and modalities due to the natural locality of CNNs. Here we propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem. Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically to respect the modality gap and spatial discrepancy in unaligned regions. Specifically, we propose to use intra-modal self-attention to explore complementary global contexts, and measure spatial-aligned inter-modal attention locally to capture cross-modal correlations. In addition, we present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path and improve the fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our designs and the consistent improvement over state-of-the-art models.
翻译:现有大多数RGB-D显著物体探测方法都遵循CNN模式,由于CNN的自然位置,无法在空间和模式上建模跨空间和模式的远距离依赖性。 我们在这里建议采用新的多式变压器(HCT)来解决这个问题。 与以前直接连接两种模式中所有补丁的多式变压器不同,我们探索跨模式的互补性,以便尊重不结盟区域的模式差距和空间差异。 具体地说,我们提议利用内部自觉来探索互补的全球环境,并衡量本地空间对地的调和模式间注意以捕捉跨模式的关联性。 此外,我们提出了变压器(FPT)的地貌式变压式变压器模块,以促进信息化的跨规模整合以及一致性和兼容性模块,以拆解多模式融合路径并改善融合的适应性。 我们对大量公共数据集进行了全面实验,以核实我们设计的功效和对状态模型的不断改进。