Automatic speech recognition (ASR) systems are prevalent, particularly in applications for voice navigation and voice control of domestic appliances. The computational core of ASRs are deep neural networks (DNNs) that have been shown to be susceptible to adversarial perturbations; easily misused by attackers to generate malicious outputs. To help test the correctness of ASRS, we propose techniques that automatically generate blackbox (agnostic to the DNN), untargeted adversarial attacks that are portable across ASRs. Much of the existing work on adversarial ASR testing focuses on targeted attacks, i.e generating audio samples given an output text. Targeted techniques are not portable, customised to the structure of DNNs (whitebox) within a specific ASR. In contrast, our method attacks the signal processing stage of the ASR pipeline that is shared across most ASRs. Additionally, we ensure the generated adversarial audio samples have no human audible difference by manipulating the acoustic signal using a psychoacoustic model that maintains the signal below the thresholds of human perception. We evaluate portability and effectiveness of our techniques using three popular ASRs and three input audio datasets using the metrics - WER of output text, Similarity to original audio and attack Success Rate on different ASRs. We found our testing techniques were portable across ASRs, with the adversarial audio samples producing high Success Rates, WERs and Similarities to the original audio.
翻译:自动语音识别系统(ASR)非常普遍,特别是在语音导航和家用电器语音控制的应用中。ASR的计算核心是深神经网络(DNNS),其计算核心显示很容易受到对抗性扰动;攻击者容易滥用以产生恶意产出。为了帮助测试ASRS的正确性,我们提议了自动生成黑盒(对DNN而言是不可知的)非目标对抗性攻击的技术,这些技术可在ASR中携带。关于对抗性ASR测试的现有许多工作侧重于有针对性的攻击,即生成有输出文本的音频样本。目标技术不是便携式的,定制于特定ASR(DNNIS)的结构。相比之下,我们的方法攻击了ASR管道的信号处理阶段,该管道在大多数ASR中共享。此外,我们通过使用将音频感应模型调控音频信号维持在人类认知门槛以下的音频感应模型,确保生成的音频样本没有人类的明显差异。我们用三种通用ASR技术评估其可移植性和有效性和效力,我们用原始的ASR图像对AVSR进行原始测试,我们通过原始的音频记录和移动速度测量。