Significant performance improvement has been achieved for fully-supervised video salient object detection with the pixel-wise labeled training datasets, which are time-consuming and expensive to obtain. To relieve the burden of data annotation, we present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations". Specifically, an "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling based on our new weak annotations. Further, we design a novel foreground-background similarity loss to further explore the labeling similarity across frames. A weak annotation boosting strategy is also introduced to boost our model performance with a new pseudo-label generation technique. Extensive experimental results on six benchmark video saliency detection datasets illustrate the effectiveness of our solution.
翻译:为了减轻数据批注的负担,我们展示了第一个以标签为“固化制导细缩图”为基础的、监管不力的视频显要物体探测模型。具体地说,提议了一个“Appear-move 聚变模块”和基于双向的ConvLSTM框架,以便在我们新的微弱说明的基础上,实现有效的多模式学习和长期时间环境建模。此外,我们设计了一个新的地表-地表相似性损失,以进一步探索跨框架的标签相似性。还引入了一种弱化注解促进战略,用新的假标签生成技术提高我们的模型性能。关于六个基准视频显像检测数据集的广泛实验结果说明了我们解决方案的有效性。