Automatic Speech Recognition services (ASRs) inherit deep neural networks' vulnerabilities like crafted adversarial examples. Existing methods often suffer from low efficiency because the target phases are added to the entire audio sample, resulting in high demand for computational resources. This paper proposes a novel scheme named FAAG as an iterative optimization-based method to generate targeted adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG generates adversarial audio in high quality with a high success rate timely. Specifically, we use audio's logits output to map each character in the transcription to an approximate position of the audio's frame. Thus, an adversarial example can be generated by FAAG in approximately two minutes using CPUs only and around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically, the FAAG method can speed up around 60% compared with the baseline method during the adversarial example generation process. Furthermore, we found that appending benign audio to any suspicious examples can effectively defend against the targeted adversarial attack. We hope that this work paves the way for inventing new adversarial attacks against speech recognition with computational constraints.
翻译:自动语音识别服务(ASRs)继承了深层神经网络的弱点,如编造的对抗性实例。现有方法往往效率低,因为目标阶段被添加到整个音频样本中,导致对计算资源的高需求。本文提议了一个名为FAAG的新方案,作为迭代优化法,快速生成有针对性的对抗性实例。通过在音频的初始部分注入噪音,FAAG生成了高质量的对立音频,并及时取得了很高的成功率。具体地说,我们使用音频登录输出来将音频转录中的每个字符映射到音频框的大致位置。因此,FAAAG可以在大约两分钟内仅使用CPU和大约10秒钟内生成一个对抗性例子,同时将平均成功率维持在85%以上。具体地说,FAAG方法可以加快约60%的进度,在生成对抗性实例过程中与基线方法相比。此外,我们发现将良音附在任何可疑的例子中可以有效防御定向对抗性攻击。我们希望这项工作为发明新的对抗性攻击语音攻击提供一条途径,反对计算限制。