Video saliency prediction has recently attracted attention of the research community, as it is an upstream task for several practical applications. However, current solutions are particularly computationally demanding, especially due to the wide usage of spatio-temporal 3D convolutions. We observe that, while different model architectures achieve similar performance on benchmarks, visual variations between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchical multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel reduction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hollywood2 benchmarks, while enhancing significantly the efficiency of the model. Code is on https://github.com/feiyanhu/tinyHD
翻译:最近,视频显著预测吸引了研究界的注意,因为它是若干实际应用的上游任务。然而,目前的解决方案在计算上特别困难,特别是由于广泛使用时空3D变相。我们注意到,虽然不同的模型结构在基准上的表现相似,但预测显著地图之间的视觉变化仍然很大。受这一直觉的启发,我们提出了一个轻量模型,采用多种简单多式解码器,并采取若干实用方法提高准确性,同时降低计算成本,例如等级多层多层知识蒸馏、多输出突出预测、无标签辅助数据集和在教师助理监督下的频道减少。我们的方法在水平上或优于在DFH1K、UCF-Sport和好莱坞2基准方面的最新预测方法,同时大大提高模型的效率。代码在 https://github.com/feiyanhu/tinyHD上。