3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local spacetime according to its kernel size, while human attention is always attracted by relational visual features at different time. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal subspaces. Extensive experiments demonstrate the contributions of key components of our method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets clearly prove the superiority of the proposed model compared with all state-of-the-art models.
翻译:3D 进化神经网络在计算机视觉视频任务方面取得了可喜的成果,包括本文所探讨的视频显著预测。然而,3D进化编码仅仅根据内核大小在固定的当地时空时段的视觉代表,而人类的注意力总是在不同时间被关联视觉特征所吸引。为了克服这一限制,我们提议建立一个新型的Spatio-Temporal自控3D网络(STSANet),用于视频显著预测,其中多个Spatio-Tempal自控模块(STSA)用于3D 时空主干柱的不同级别,以直接捕捉不同时间步骤的spatio-时空特征之间的长距离关系。此外,我们提议建立一个注意性多空间组合模块,将多层次特征与对语管和时空子子子子子子子子空间环境的认识相结合。广泛的实验展示了我们方法的关键组成部分的贡献,DHF1K、好莱坞-2、UCF和DIEM基准数据集的结果,清楚地证明了拟议模型与所有状态比较的优越性。