Recently, massive saliency detection methods have achieved promising results by relying on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Apart from the traditional transformer architecture used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models.
翻译:最近,大型显要检测方法通过依靠CNN的架构取得了有希望的成果。 或者,我们从无革命序列序列到序列结构的角度重新思考这项任务,通过模拟长期依赖性来预测显著性,而这种模型是无法实现的。具体地说,我们开发了一个基于纯变压器的新颖的统一模型,即视觉萨利因特变异器(VST),用于RGB和RGB显要对象探测(SOD),将图像补丁作为投入,利用变压器在图像补丁之间传播全球背景。除了在愿景变换器(VIT)中使用的传统变压器结构外,我们还利用多级象征性聚合,并在变压器框架下提出一个新的象征性加压方法,以获得高分辨率检测结果。我们还开发了一种基于象征性的多塔变压器(VST),以同时进行显著的检测和边界检测,为此引入了与任务有关的象征和新的补差感应机制。实验结果显示,我们的模型超越了在图像变压器(MostGB)和RGB-D新模型上的现有状态结果。