It is believed that human vision system (HVS) consists of pre-attentive process and attention process when performing salient object detection (SOD). Based on this fact, we propose a four-stage framework for SOD, in which the first two stages match the \textbf{P}re-\textbf{A}ttentive process consisting of general feature extraction (GFE) and feature preprocessing (FP), and the last two stages are corresponding to \textbf{A}ttention process containing saliency feature extraction (SFE) and the feature aggregation (FA), namely \textbf{PAANet}. According to the pre-attentive process, the GFE stage applies the fully-trained backbone and needs no further finetuning for different datasets. This modification can greatly increase the training speed. The FP stage plays the role of finetuning but works more efficiently because of its simpler structure and fewer parameters. Moreover, in SFE stage we design for saliency feature extraction a novel contrast operator, which works more semantically in contrast with the traditional convolution operator when extracting the interactive information between the foreground and its surroundings. Interestingly, this contrast operator can be cascaded to form a deeper structure and extract higher-order saliency more effective for complex scene. Comparative experiments with the state-of-the-art methods on 5 datasets demonstrate the effectiveness of our framework.
翻译:据认为,人类视觉系统(HVS)在进行突出物体探测时包含注意前过程和注意过程。基于这一事实,我们提议了一个SOD四阶段框架,其中前两个阶段与一般特征提取(GFE)和特征预处理(FP)构成的加速过程相匹配,最后两个阶段与包含突出特征提取(SFE)和特征汇总(FA)的注意过程相对应。根据加速前过程,GFE阶段采用完全训练的骨架,不需要对不同的数据集进行进一步的微调。这种修改可以大大提高培训速度。FP阶段的作用是微调,但由于结构简便和参数较少,工作效率更高。此外,在SFE阶段,我们设计了一个新的突出特征提取对比操作器(SFE)和特征聚合组合(FA),即\ textbff{PANet}。根据加速前过程,GFE阶段应用经过充分训练的骨架,不需要对不同的数据集进行进一步的微调。为了更精确的操作器结构,可以展示一个更精确的更精确的更精确的比重的模型结构。