Despite remarkable improvements, automatic speech recognition is susceptible to adversarial perturbations. Compared to standard machine learning architectures, these attacks are significantly more challenging, especially since the inputs to a speech recognition system are time series that contain both acoustic and linguistic properties of speech. Extracting all recognition-relevant information requires more complex pipelines and an ensemble of specialized components. Consequently, an attacker needs to consider the entire pipeline. In this paper, we present VENOMAVE, the first training-time poisoning attack against speech recognition. Similar to the predominantly studied evasion attacks, we pursue the same goal: leading the system to an incorrect and attacker-chosen transcription of a target audio waveform. In contrast to evasion attacks, however, we assume that the attacker can only manipulate a small part of the training data without altering the target audio waveform at runtime. We evaluate our attack on two datasets: TIDIGITS and Speech Commands. When poisoning less than 0.17% of the dataset, VENOMAVE achieves attack success rates of more than 80.0%, without access to the victim's network architecture or hyperparameters. In a more realistic scenario, when the target audio waveform is played over the air in different rooms, VENOMAVE maintains a success rate of up to 73.3%. Finally, VENOMAVE achieves an attack transferability rate of 36.4% between two different model architectures.
翻译:尽管自动语音识别取得了显著的进展,但其容易受到对抗性扰动的影响。与标准机器学习体系结构相比,这些攻击要困难得多,特别是由于语音识别系统的输入是包含语音的声学和语言属性的时间序列。提取所有与识别相关的信息需要更复杂的管线和一组专门的组件。因此,攻击者需要考虑整个管线。在本文中,我们提出VENOMAVE,针对语音识别的第一种训练时毒化攻击。与主要研究逃避攻击相似,我们追求相同的目标:将系统引导到攻击者选择的目标音频波形的不正确转录。然而,与逃避攻击不同,我们假设攻击者只能操纵少量训练数据而不改变运行时的目标音频波形。我们在两个数据集TIDIGITS和语音命令上评估了我们的攻击。当毒化数据集的比例小于0.17%时,VENOMAVE可以在没有访问受害者网络架构或超参数情况下实现80.0%以上的攻击成功率。在更现实的场景中,当目标音频波形在不同房间播放时,VENOMAVE可以保持高达73.3%的攻击成功率。最后,VENOMAVE在两种不同的模型架构之间实现了36.4%的攻击可迁移率。