Audio-visual active speaker detection (AVASD) is well-developed, and now is an indispensable front-end for several multi-modal applications. However, to the best of our knowledge, the adversarial robustness of AVASD models hasn't been investigated, not to mention the effective defense against such attacks. In this paper, we are the first to reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks through extensive experiments. What's more, we also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples under an allocated attack budget. The loss aims at pushing the inter-class embeddings to be dispersed, namely non-speech and speech clusters, sufficiently disentangled, and pulling the intra-class embeddings as close as possible to keep them compact. Experimental results show the AVIL outperforms the adversarial training by 33.14 mAP (%) under multi-modal attacks.
翻译:视听活跃的扬声器探测( AVASD) 是发展完善的, 现在是多个多模式应用中不可或缺的前端。 但是, 据我们所知, AVASD 模型的对抗性强势还没有被调查, 更不用提有效防范这类攻击。 在本文中, 我们首先通过广泛的实验, 暴露了AVASD 模型在音频、 视觉和视听对抗性攻击下的脆弱性。 此外, 我们还提出一个新的视听互动损失( AVIL), 使攻击者很难在分配的攻击预算下找到可行的对抗性例子。 损失的目的是推动阶级间嵌入的分散, 即非语音和语音集群, 足够分解, 并尽可能拉动阶级内嵌入, 以保持其紧凑。 实验结果显示 AVIL 在多模式攻击下比对抗性训练高出33.14 mAP (% ) 。