Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during the speech turn-taking in multiple-face videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual branch takes the RGB frames as the input and encodes them into visual feature maps. The audio and face branches encode the audio signal and multiple cropped faces, respectively. A fusion module is introduced to integrate the information from three modalities, and to generate the final saliency map. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works. It performs closer to human multi-modal attention.
翻译:最近,视频流占据了互联网流量的很大一部分,其中多数含有人的脸孔。因此,有必要预测多面视频的显著性,这可以为许多内容应用程序提供关注提示。然而,多数多面显著预测只考虑视觉信息,忽视音频,这与自然假设不相符。一些行为研究已经确定声音影响人们的注意力,特别是在多面视频的语音转换过程中。在本文中,我们通过建立一个视觉-奥迪奥状态多面视频大型眼睛跟踪数据库(MVVA)来彻底调查这种影响。根据我们的调查结果,我们提出了一个由三个分支组成的新的多面视频显著模型:视觉、声音和脸部。视觉分支将RGB框架作为输入内容,并将其编码为视觉特征地图。音频和面部分别将音频信号和多色面编码成音频信号。我们引入了一个聚合模块,以整合三种模式的信息,并生成最后的突出的地图。实验结果显示,拟议的方法将更接近人的注意力定位。