Video salient object detection (VSOD) is an important task in many vision applications. Reliable VSOD requires to simultaneously exploit the information from both the spatial domain and the temporal domain. Most of the existing algorithms merely utilize simple fusion strategies, such as addition and concatenation, to merge the information from different domains. Despite their simplicity, such fusion strategies may introduce feature redundancy, and also fail to fully exploit the relationship between multi-level features extracted from both spatial and temporal domains. In this paper, we suggest an adaptive local-global refinement framework for VSOD. Different from previous approaches, we propose a local refinement architecture and a global one to refine the simply fused features with different scopes, which can fully explore the local dependence and the global dependence of multi-level features. In addition, to emphasize the effective information and suppress the useless one, an adaptive weighting mechanism is designed based on graph convolutional neural network (GCN). We show that our weighting methodology can further exploit the feature correlations, thus driving the network to learn more discriminative feature representation. Extensive experimental results on public video datasets demonstrate the superiority of our method over the existing ones.
翻译:可靠的 VSOD 要求同时利用空间领域和时间领域的信息。 多数现有的算法只是利用简单的聚合战略,例如添加和连接,将不同领域的信息合并起来。 尽管这些合并战略简单,但这种合并战略可能带来特征冗余,而且未能充分利用从空间和时间领域提取的多层次特征之间的关系。 在本文中,我们建议为VSOD 建立一个适应性地方-全球改进框架。与以往不同的做法不同,我们提议一个地方改进结构和一个全球改进框架,以完善与不同范围相融合的简单组合特征,这可以充分探索多层次特征对当地的依赖和全球依赖性。此外,为了强调有效信息和抑制无用的特征,一个适应性加权机制是建立在图象卷发神经网络(GCN)基础上的。我们表明,我们的加权方法可以进一步利用特征相关性,从而推动网络学习更具有歧视性的特征代表。关于公共视频数据集的广泛实验结果显示我们的方法优于现有方法的优越性。