Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.
翻译:或者,我们从无变序序列到序列序列的视角重新思考这项任务,并通过模拟长距离依赖性来预测显著性,而这种模型是不能由变相实现的。具体地说,我们开发了一种基于纯变压器的新颖的统一模型,即视觉光亮变异器(VST),用于RGB和RGB显著对象探测(SOD),将图像补丁作为投入,利用变压器在图像补丁之间传播全球背景。与View变压器(VIT)中使用的传统结构不同,我们利用多级象征性聚合,并在变压器框架下提出一个新的象征性的抽查方法,以获得高分辨率检测结果。我们还开发了一种基于象征性的多塔斯克变异器(VST),通过引入与任务相关的象征和新颖的补丁差置机制,同时进行突出的探测。实验结果表明,我们的变压器在RGB和RGB-D SOD基准变压模型(VD)中都比现有的方法。最重要的是,我们在变压模型中提供的是一个新的实地数据模型。