Adversarial attack approaches to speaker identification either need high computational cost or are not very effective, to our knowledge. To address this issue, in this paper, we propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED), to generate adversarial voice examples to speaker identification. It contains two novel components. First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system, so as to make the attacker focus on generating artificial noise to the important samples. It also proposes an angular loss function to push the speaker embedding far away from the source speaker. Our experimental results demonstrate that the proposed SSED yields the state-of-the-art performance, i.e. over 97% targeted attack success rate and a signal-to-noise level of over 39 dB on both the open-set and close-set speaker identification tasks, with a low computational cost.
翻译:为了解决这一问题,我们在本文件中提议采用新型的代际网络方法,称为对称突出分辨码解码器(SSED),以生成用于辨别语音的对立语音示例。它包含两个新构件。首先,它使用新颖的突出地图解码器,以了解语音样本对定向语音识别系统决定的重要性,从而使攻击者侧重于为重要样本制造人造噪音。它还提出一个角形损失功能,将发言者推离源演讲者远处。我们的实验结果显示,拟议的SSED生成了最先进的性能,即97%以上的定向攻击成功率和39DB的信号到音级水平,用于开放式和近端语音识别任务,且计算成本低。