视频远视物体探测指导和教学网络 (Guidance and Teaching Network for Video Salient Object Detection)

Owing to the difficulties of mining spatial-temporal cues, the existing approaches for video salient object detection (VSOD) are limited in understanding complex and noisy scenarios, and often fail in inferring prominent objects. To alleviate such shortcomings, we propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet), to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level, respectively. To be specific, we (a) introduce a temporal modulator to implicitly bridge features from motion into the appearance branch, which is capable of fusing cross-modal features collaboratively, and (b) utilise motion-guided mask to propagate the explicit cues during the feature aggregation. This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities. Extensive experiments on three challenging benchmarks show that the proposed method can run at ~28 fps on a single TITAN Xp GPU and perform competitively against 14 cutting-edge baselines.

翻译：由于采矿空间-时空信号的困难,现有的视频突出物体探测方法(VSOD)在理解复杂和噪音情景方面受到限制,常常无法推断突出物体。为了减轻这些缺陷,我们提议一个简单而有效的结构,称为指导与教学网络(GTNet),分别以隐含的指导和明确教学的方式独立地提取有效的空间与时间信号,在特征和决策层面分别进行隐含的指导和明确教学。具体而言,我们(a) 引入一个时间调节器,以隐含连接功能,从运动到外观分支,该外观分支能够协同使用跨模式特征;以及(b) 使用运动引导面罩,在特征汇总期间传播明确的线索。这一新的学习战略通过分解复杂的空间-时空提示和绘制不同模式的信息提示,取得了令人满意的结果。关于三个具有挑战性的基准的广泛实验表明,拟议的方法可以在单个TITAN Xp GPUP上运行~28英尺,并在14个尖端基线上竞争。