We construct audio adversarial examples on automatic Speech-To-Text systems . Given any audio waveform, we produce an another by overlaying an audio vocal mask generated from the original audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx. In addition, we engaged human annotators to transcribe the adversarial audio. Our experiments show that these adversarial examples fool State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently pick out the speech. The feasibility of this attack introduces a new domain to study machine and human perception of speech.
翻译:我们为自动语音到文字系统构建了声音对抗性实例。 在任何音波形式下, 我们通过覆盖原始音频生成的音频口罩来生成另一个。 我们将我们的音频对抗性攻击应用到五个SOTA STT系统: DeepSpeech、Julius、Kaldi、 wav2letter@ anywhere和CMUSphinx。 此外, 我们聘请了人类告示员来对对对抗性音频进行编译。 我们的实验显示, 这些对抗性例子愚弄了国家- 艺术语音到文字系统, 然而人类能够始终如一地选择演讲。 这次攻击的可行性引入了一个新的领域来研究机器和人类对语言的感知。