In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking.
翻译:在认证情景中,实际的扬声器核查系统的应用通常要求一个人阅读动态认证文本。以前的研究以声对称模式作为数字信号来进行物理攻击,这种攻击很容易被音频重放探测模块拒绝。 这项工作表明,在对手发言时,通过将我们精心设计的对抗性扰动作为单独来源,实际的扬声器核查系统将错误判断对手为目标演讲者。建议采用两步算法,优化通用的对称扰动,使其依赖文字,对认证文本的识别没有多大影响。我们还估计了允许扰动在空气中起作用的算法中的房间脉冲反应(RIR ) 。在实际实验中,我们以100%的成功率实现了定向攻击,而语音识别的词误差率(WER)仅提高了3.55%。录音记录可以通过对现场人说话的重放探测。