Although end-to-end automatic speech recognition (e2e ASR) models are widely deployed in many applications, there have been very few studies to understand models' robustness against adversarial perturbations. In this paper, we explore whether a targeted universal perturbation vector exists for e2e ASR models. Our goal is to find perturbations that can mislead the models to predict the given targeted transcript such as "thank you" or empty string on any input utterance. We study two different attacks, namely additive and prepending perturbations, and their performances on the state-of-the-art LAS, CTC and RNN-T models. We find that LAS is the most vulnerable to perturbations among the three models. RNN-T is more robust against additive perturbations, especially on long utterances. And CTC is robust against both additive and prepending perturbations. To attack RNN-T, we find prepending perturbation is more effective than the additive perturbation, and can mislead the models to predict the same short target on utterances of arbitrary length.
翻译:虽然端到端自动语音识别(e2e ASR)模型在许多应用中广泛应用,但很少有研究来了解模型对对抗性扰动的稳健性。 在本文中,我们探讨e2e ASR模型是否存在有针对性的普遍扰动矢量。 我们的目标是找到能够误导模型的扰动,以预测特定目标记录,如“谢谢你”或任何输入内容的空字符串。我们研究了两种不同的攻击,即添加和预断扰动,以及它们对于最先进的LAS、CTC和RNN-T模型的性能。我们发现LAS是三种模型中最容易受扰动的。 RNN-T对添加性扰动性扰动性更强。而CT既能防止添加性,又防止预设扰动性干扰。为了攻击RNNN-T,我们发现预断性扰动比添加性扰动更有效,并且能够误导模型预测任意长度的短目标。