Video salient object detection aims to find the most visually distinctive objects in a video. To explore the temporal dependencies, existing methods usually resort to recurrent neural networks or optical flow. However, these approaches require high computational cost, and tend to accumulate inaccuracies over time. In this paper, we propose a network with attention modules to learn contrastive features for video salient object detection without the high computational temporal modeling techniques. We develop a non-local self-attention scheme to capture the global information in the video frame. A co-attention formulation is utilized to combine the low-level and high-level features. We further apply the contrastive learning to improve the feature representations, where foreground region pairs from the same video are pulled together, and foreground-background region pairs are pushed away in the latent space. The intra-frame contrastive loss helps separate the foreground and background features, and the inter-frame contrastive loss improves the temporal consistency. We conduct extensive experiments on several benchmark datasets for video salient object detection and unsupervised video object segmentation, and show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
翻译:视频显要对象探测的目的是在视频中找到最有视觉特征的物体。 为了探索时间依赖性, 现有的方法通常使用经常性神经网络或光学流。 但是, 这些方法需要很高的计算成本, 并且往往会随着时间积累不准确性。 在本文中, 我们提出一个有关注模块的网络, 在不使用高计算时间模型技术的情况下, 学习视频显要对象探测的对比性特征。 我们开发了一个非本地的自我注意计划, 以在视频框中捕捉全球信息。 一种共同注意配方用于将低水平和高水平的特征结合起来。 我们进一步应用对比性学习来改进特征显示, 将同一视频的地面区域对配对拉在一起, 而在潜在空间中, 地表对地区域对对对调被推开。 框架内对比性损失有助于区分地貌特征和背景特征, 以及框架间对比性损失可以改善时间一致性。 我们用多个基准数据集进行广泛的实验, 用于视频显要对象探测和不超高水平的视频对象分割, 并显示拟议方法需要较少的计算、 和进行偏向状态的方法。