In this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using features extracted at different abstraction levels. We provide the base hierarchical learning mechanism with two techniques for domain adaptation and domain-specific learning. For the former, we encourage the model to unsupervisedly learn hierarchical general features using gradient reversal at multiple scales, to enhance generalization capabilities on datasets for which no annotations are provided during training. As for domain specialization, we employ domain-specific operations (namely, priors, smoothing and batch normalization) by specializing the learned features on individual datasets in order to maximize performance. The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction. When the base hierarchical model is empowered with domain-specific modules, performance improves, outperforming state-of-the-art models on three out of five metrics on the DHF1K benchmark and reaching the second-best results on the other two. When, instead, we test it in an unsupervised domain adaptation setting, by enabling hierarchical gradient reversal layers, we obtain performance comparable to supervised state-of-the-art.
翻译:在这项工作中,我们提出一个3D全演进式的视频显著预测结构,对利用不同抽象层面的特征生成的中间地图(称为共识地图)进行等级监督(称为共识地图),对利用不同抽象层面的特征生成的中间地图进行分级监督;我们为基础等级学习机制提供两种领域适应和具体领域学习的技术;对于前者,我们鼓励该模型以多种规模的梯度反转法不经监督地学习一般等级特征,以提高在培训期间没有说明的数据集的通用能力;关于域专门化,我们采用特定域的操作(即前期、平滑和分批),专门对单个数据集的学习特征进行专门研究,以最大限度地提高性能;我们实验结果显示,拟议的模型在受监督的突出度预测中产生最先进的准确性。当基础等级模型被授权使用特定领域模块时,性能会提高,在DHF1K基准的三度中,三度优于最新状态模型,并在其他两度上达到第二最佳结果。相反,我们用非超近的域域进行测试时,我们通过监督的等级变换等级的等级水平进行。