The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE). Code will be available at https://github.com/OliverRensu/GLSTR.
翻译:长期以来,完全连锁网络(FCN)一直占据着显要物体探测的突出位置。然而,CNN所在地点要求模型深度足够深,足以具有全球可接受域,而这种深度模型总是导致本地细节的丢失。在本文中,我们引入了新的关注编码器、视觉变压器,成为突出对象探测,以确保表层从浅层到深层的全球化。在非常浅的层中,变压器编码器保留了更局部的表示,以在最后显要地图中恢复空间细节。此外,由于每个层都能够捕捉到其前层的全球观点,相邻层可以隐含地最大限度地扩大代表差异,并尽量减少冗余特性,使得变压器层的每一项输出特性都为最终预测做出独特的贡献。为了从变压器中解码功能,我们提出了一个简单而有效的深层变压解码器。变压器密度和加压器特性,生成最后显要的显要图,以较少的噪音注入。实验结果显示,我们的方法大大优于其他以FCN为基础和变压以RNS为基础的方法,在五度/RMERG值中,以10/MA/G号为平均的底值为基准。