Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. However, the traditional convolution could not capture the global features of the image well due to its small kernel size. Besides, the high-level factors which closely correlate to human visual perception, e.g., objects, color, light, etc., are not considered. Inspired by these, we propose a Transformer-based method with semantic segmentation as another learning objective. More global cues of the image could be captured by Transformer. In addition, simultaneously learning the object segmentation simulates the human visual perception, which we would verify in our investigation of human gaze control in cognitive science. We build an extra decoder for the subtask and the multiple tasks share the same Transformer encoder, forcing it to learn from multiple feature spaces. We find in practice simply adding the subtask might confuse the main task learning, hence Multi-task Attention Module is proposed to deal with the feature interaction between the multiple learning targets. Our method achieves competitive performance compared to other state-of-the-art methods.
翻译:以 RGB 图像显示的人类眼睛的注意分布 。 最新最先进的方法大多基于传统CNN 的深度图像特征演示。 然而,传统变异由于内核大小小,无法捕捉图像的全局特征。 此外,没有考虑到与人类视觉感知密切相关的高层次因素,例如物体、颜色、光等。 受这些因素的启发,我们提议采用以变异器为基础的方法,将语义分化作为另一个学习目标。 变异器可以捕捉更多全球图像线索。 此外,同时学习天体分解模拟人类视觉感知,我们将在对认知科学中的人类凝视控制进行调查时加以核实。 我们为子任务和多重任务建造了一个额外的解码器, 共享相同的变异变器编码器, 迫使它从多个特征空间学习。 我们发现, 在实践中, 仅仅添加子塔斯克 可能混淆主要任务学习过程, 因此多功能注意模块被提议处理多个学习目标之间的地貌互动。