促进动态团结预测的视听协作代表学习 (Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction)

The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive the dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglect the accompanied audio information, which can provide complementary information for the scene understanding. In fact, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio-visual collaborative representation learning method is proposed for the DSP task, which explores the audio modality to better predict the dynamic saliency map by assisting vision modality. The proposed method consists of three parts: 1) audio-visual encoding, 2) audio-visual location, and 3) collaborative integration parts. Firstly, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio-visual location part is devised to locate the sound source in the visual scene by learning the correspondence between audio-visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate audio-visual information and center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

翻译：动态感光预报(DSP)任务模拟了人类选择性关注机制,以感知动态场景,这是许多视觉任务中重要和必要的。大多数现有方法只考虑视觉提示,而忽略了伴随的音频信息,为现场理解提供补充信息。事实上,听觉和视觉提示之间存在一种强烈的关系,人类一般通过协作感测这些提示来了解周围的景象。为此,为DSP任务提出了一种视听合作教学方法,探索通过协助视觉模式更好地预测动态显要地图的音频模式。拟议的方法包括三个部分:(1)视听编码,(2)视听位置,以及(3)协作整合部分。首先,采用了完善的SoundNet结构来编码音频模式,以获得相应的特征,并采用了3D ResNet-50结构来学习视觉特征,既包含空间位置,也包含时间运动信息。第二,视听定位部分旨在通过学习视听信息之间的通信在视觉场景中定位声音源。第三,合作整合了具有挑战性的A-SMA-ML 数据,包括用于适应性总体视听轨道的A-SMA-S-SMA-S-SMA-SAL-S-S-SAL-SAL-SMAD-SAL-SAL-S-S-SMA-S-S-S-SD-S-SMAD-SD-SMA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SMA-SMA-S-SMA-S-SMA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S