Existing CNNs-Based RGB-D salient object detection (SOD) networks are all required to be pretrained on the ImageNet to learn the hierarchy features which helps provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize self-supervised representation learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pretraining, which makes the network capture rich semantic contexts and reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a consistency-difference aggregation (CDA) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The CDA module is general and suitable for cross-modal and cross-level feature fusion. Extensive experiments on six benchmark datasets show that our self-supervised pretrained model performs favorably against most state-of-the-art methods pretrained on ImageNet. The source code will be publicly available at \textcolor{red}{\url{https://github.com/Xiaoqi-Zhao-DLUT/SSLSOD}}.
翻译:CNN 以现有 CNN 为基础的 RGB-D 显要物体探测( SOD) 网络都需要在图像网络上预先培训,以了解有助于良好初始化的等级特征。 然而,大型数据集的收集和批注耗时费钱。 在本文中,我们利用自我监督的代表学习(SSL)来设计两个托辞任务:跨模式自动编码和深度估计。我们的托辞任务只需要几个没有标签的 RGB-D 数据集来进行预培训,从而使得网络能够捕捉丰富的语义背景,缩小两种模式之间的差距,从而为下游任务提供有效的初始化。此外,对于RGB-D SOD的跨模式融合的固有问题,我们建议一个将单一特性混合成多方向混合模块,以便对一致和差异信息有适当的认识。 CDA 模块对于跨模式和跨层次的 RGBR- RGB-D 数据样本组合来说是一般和跨层次的,从而能够有效地为下游任务提供一个有效的初始初始初始初始版本。 在六种基准来源的自我测试中, 将展示我们之前的亚程前图像源代码。