Fully supervised salient object detection (SOD) methods have made considerable progress in performance, yet these models rely heavily on expensive pixel-wise labels. Recently, to achieve a trade-off between labeling burden and performance, scribble-based SOD methods have attracted increasing attention. Previous models directly implement the SOD task only based on small-scale SOD training data. Due to the limited information provided by the weakly scribble tags and such small-scale training data, it is extremely difficult for them to understand the image and further achieve a superior SOD task. In this paper, we propose a simple yet effective framework guided by general visual representations that simulate the general cognition of humans for scribble-based SOD. It consists of a task-related encoder, a general visual module, and an information integration module to combine efficiently the general visual representations learned from large-scale unlabeled datasets with task-related features to perform the SOD task based on understanding the contextual connections of images. Meanwhile, we propose a novel global semantic affinity loss to guide the model to perceive the global structure of the salient objects. Experimental results on five public benchmark datasets demonstrate that our method that only utilizes scribble annotations without introducing any extra label outperforms the state-of-the-art weakly supervised SOD methods and is comparable or even superior to the state-of-the-art fully supervised models.
翻译:完全监督的显要物体探测方法(SOD)在性能方面取得了相当大的进步,但这些模型在很大程度上依赖昂贵的像素标签。最近,为了在标签负担和性能之间实现平衡,基于刻字的SOD方法引起了越来越多的注意。以前的模型直接执行SOD任务,仅以小规模的SOD培训数据为基础。由于薄弱的刻字标记和这种小规模培训数据提供的信息有限,它们很难理解图像,进一步完成高级的SOD任务。在本文件中,我们提出了一个简单而有效的框架,以一般视觉显示为指导,以模拟人类对刻字的SOD的一般认知。它包括一个与任务有关的编码、一个一般视觉模块,以及一个信息整合模块,以便有效地将从大规模无标记的数据集和任务相关的特性中获取的一般视觉表述结合起来,在了解图像的背景联系的基础上执行SOD任务。同时,我们提议一种新的全球语义性接近性损失,以引导模型甚至以直观的视觉模型模拟基于刻板的SOD的SOD模型。 实验性结果只展示了标定的更精确的比标的模型。