Both visual and auditory information are valuable to determine the salient regions in videos. Deep convolution neural networks (CNN) showcase strong capacity in coping with the audio-visual saliency prediction task. Due to various factors such as shooting scenes and weather, there often exists moderate distribution discrepancy between source training data and target testing data. The domain discrepancy induces to performance degradation on target testing data for CNN models. This paper makes an early attempt to tackle the unsupervised domain adaptation problem for audio-visual saliency prediction. We propose a dual domain-adversarial learning algorithm to mitigate the domain discrepancy between source and target data. First, a specific domain discrimination branch is built up for aligning the auditory feature distributions. Then, those auditory features are fused into the visual features through a cross-modal self-attention module. The other domain discrimination branch is devised to reduce the domain discrepancy of visual features and audio-visual correlations implied by the fused audio-visual features. Experiments on public benchmarks demonstrate that our method can relieve the performance degradation caused by domain discrepancy.
翻译:视觉和听觉信息对于确定视频中的突出区域都非常宝贵。深相神经网络(CNN)展示了应对视听突出预测任务的强大能力。由于射击场景和天气等各种因素,源培训数据和目标测试数据之间往往存在适度分布差异。域差异导致CNN模型的目标测试数据出现性能退化。本文试图及早解决视听突出预测方面不受监督的域适应问题。我们提议了一种双重域对称学习算法,以缩小源数据和目标数据之间的域差。首先,为调和监听功能分布建立了一个特定的域区分分支。随后,这些监听功能通过一个跨模式的自我注意模块与视觉特征结合。另一个域歧视分支旨在减少受精的视听特征所隐含的视觉特征和视听关联的域差。对公共基准的实验表明,我们的方法可以缓解域差造成的性退化。