Depth can provide useful geographical cues for salient object detection (SOD), and has been proven helpful in recent RGB-D SOD methods. However, existing video salient object detection (VSOD) methods only utilize spatiotemporal information and seldom exploit depth information for detection. In this paper, we propose a depth-cooperated trimodal network, called DCTNet for VSOD, which is a pioneering work to incorporate depth information to assist VSOD. To this end, we first generate depth from RGB frames, and then propose an approach to treat the three modalities unequally. Specifically, a multi-modal attention module (MAM) is designed to model multi-modal long-range dependencies between the main modality (RGB) and the two auxiliary modalities (depth, optical flow). We also introduce a refinement fusion module (RFM) to suppress noises in each modality and select useful information dynamically for further feature refinement. Lastly, a progressive fusion strategy is adopted after the refined features to achieve final cross-modal fusion. Experiments on five benchmark datasets demonstrate the superiority of our depth-cooperated model against 12 state-of-the-art methods, and the necessity of depth is also validated.
翻译:深度可以为显要物体的探测提供有用的地理线索,而且在最近的RGB-D SOD方法中已证明是有用的。然而,现有的视频突出物体探测方法仅利用时空信息,很少利用深度探测信息。在本文件中,我们提议建立一个深度合作的三模式网络,称为VSOD的DCTNet,这是将深度信息纳入帮助VSOD的先驱工作。为此,我们首先从RGB框架产生深度,然后提出一种以不平等的方式对待三种模式的方法。具体地说,一个多式注意模块(MAM)旨在模拟主要方式(RGB)和两种辅助方式(深度、光学流)之间的多式长程依赖性模型。我们还引入一个精细的组合模块(RFM),以抑制每种模式中的噪音,并动态地选择有用的信息,以便进一步改进特征。最后,在经过改进的特征之后,我们采取了一种渐进的融合战略,以实现最后的跨式融合。对五种基准数据集的实验表明我们深度和深层合作的模型方法的高度,也是对照12州验证。