In this paper, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for video saliency modeling. The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales, and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective, and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can improve the precision of video saliency detection significantly. Experimental results on three purely visual video saliency benchmarks and six audio-video saliency benchmarks demonstrate that our method achieves state-of-theart performance.
翻译:在本文中,我们提出了一个3D全变变变编码解码器结构,用于视频突出度检测,该结构将比例尺、空间和时间信息结合起来,用于视频突出度建模。编码器从输入连续视频框架中提取了多尺度的时间空间特征,然后通过时间空间变换和自上而下的特征整合构建了时间空间特征金字塔。解码器对不同尺度的时空间特征进行等级解码,最后从多个视频框架的整合中生成了一个突出的地图。我们的模型简单而有效,可以实时运行。我们进行了大量实验,结果显示设计良好的结构可以显著提高视频突出度检测的精确性。三个纯视觉光谱基准的实验结果和六个音频突出度基准表明,我们的方法达到了艺术状态的性能。