Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
翻译:基于机器学习的语音增强方法正变得日益富有表现力,使得对输入信号进行更强大的修改成为可能。本文揭示这种表现力引入了一种脆弱性:先进的语音增强模型可能易受对抗攻击。具体而言,我们证明经过精心设计并通过原始输入进行心理声学掩蔽的对抗噪声可以被注入系统,使得增强后的语音输出传达完全不同的语义含义。我们通过实验验证了当代预测式语音增强模型确实能够以这种方式被操纵。此外,我们强调采用随机采样器的扩散模型因其设计原理而对此类对抗攻击具有内在鲁棒性。