Salient Object Detection is the task of predicting the human attended region in a given scene. Fusing depth information has been proven effective in this task. The main challenge of this problem is how to aggregate the complementary information from RGB modality and depth modality. However, conventional deep models heavily rely on CNN feature extractors, and the long-range contextual dependencies are usually ignored. In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Before fusing the two branches of features into one, attention-based modules are applied to enhance features from each modality. We design a self-attention-based cross-modality interaction module and a gated modality attention module to leverage the complementary information between the two modalities. For the saliency decoding, we create different stages enhanced with dense connections and keep a decoding memory while the multi-level encoding features are considered simultaneously. Considering the inaccurate depth map issue, we collect the RGB features of early stages into a skip convolution module to give more guidance from RGB modality to the final saliency prediction. In addition, we add edge supervision to regularize the feature learning process. Comprehensive experiments on five standard RGB-D SOD benchmark datasets over four evaluation metrics demonstrate the superiority of the proposed DTMINet method.
翻译:显性物体探测是预测特定场景中人类参与的区域的任务。 显示深度信息已证明在这项任务中是有效的。 这一问题的主要挑战是如何汇总来自 RGB 模式和深度模式的补充信息。 但是,常规深度模型在很大程度上依赖CNN 地物提取器,而长距离背景依赖性通常被忽略。 在这项工作中,我们提议双双双双向- 异向基于相互互动网络。 我们采用双向- 异向转换模式作为RGB和深度模式的特征提取器,以模拟视觉输入中的长距离依赖性。 在将两个功能分支分为一个, 应用基于关注的模块来增强每种模式的特征。 我们设计了一个基于自觉的跨模式互动模块和一个封闭模式关注模块,以利用两种模式之间的互补信息。 关于突出的分解,我们创建了不同的阶段,同时考虑多层次编码特征。 我们收集了RGB的早期特征特征,然后将一个基于共变模式的组合模块应用来增强每种模式的特性。 我们设计了一个基于RGB 标准性标准标准标准的RGB 测试模式,我们增加了对RGB 标准性基准的升级的RGB 模式,我们增加了对标准性基准的RGB 标准的升级的升级的升级的模型。