In this paper, we propose to make a systematic study on machines multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.
翻译:在本文中,我们建议对受到攻击的机器多重感知进行系统研究。我们使用针对多式对抗性攻击的视听事件识别任务作为调查视听学习的稳健性的代理物。我们攻击视听一体化的视听和两种模式,以探究视听一体化是否仍然能增强感知力,以及不同融合机制如何影响视听模型的稳健性。在解释攻击情况下的多式互动时,我们学习了一种监督不力的声源视觉定位模型,以便将探测区域在视频中本地化。为了减轻多式攻击,我们提议了一种基于视听差异限制和外部特征记忆库的视听防御方法。广泛的实验表明,视听模型很容易受到多式对抗性攻击;视听一体化可以降低模型的稳健性,而不是在多式攻击下加强;甚至一种受微弱监督的声源视觉本地化模型也可以被成功愚弄;我们的防御方法可以在不显著牺牲清洁模型性能的情况下改进视听网络的易受损害性。