ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. In this paper, we not only demonstrate such attacks can happen in reality, but also show that the attacks can be systematically conducted. To minimize users' attention, we choose to embed the voice commands into a song, called CommandSong. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance. In particular, we overcome two major challenges: minimizing the revision of a song in the process of embedding commands, and letting the CommandSong spread through the air without losing the voice "command". Our evaluation demonstrates that we can craft random songs to "carry" any commands and the modify is extremely difficult to be noticed. Specially, the physical attack that we play the CommandSongs over the air and record them can success with 94 percentage.
翻译:类似Siri、Alexa、Google Voice或Cortana等自动语音识别系统(自动语音识别系统)最近变得相当受欢迎。 使这种系统在人们日常生活中实际应用的关键技术之一是深层次的学习。 虽然众所周知,计算机视觉的深层次学习很容易受到对抗性扰动的影响,但对于这种扰动在实际语音识别中是否仍然有效却知之甚少。 在本文中,我们不仅展示这种攻击在现实中可能发生,而且显示攻击可以系统进行。 为了尽量降低用户的注意力,我们选择将声音指令嵌入一首歌,即“指挥”。 这样,携带指令的歌曲可以通过无线电、电视甚至安装在智能手机等便携式设备上的任何媒体播放,从而可能长期影响数百万用户。 特别是,我们克服了两大挑战: 最大限度地减少嵌入命令过程中的歌曲修改, 并在不失去声音“ 命令” 的情况下让指挥在空中传播。 我们的评估表明, 我们可以随机制作歌曲, 任何命令和修改是极其困难的。