In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples.
翻译:在这项工作中,我们的目标是针对对抗性噪音言论实例,提高端到端自动语音识别(ASR)的系统坚固度,我们侧重于严格和经验经验的“封闭式模拟对抗性稳健性”设置(例如,在装置上或云层应用中);对抗性噪音只能通过封闭式优化(例如,进化和零级估计)产生,而不能直接获取定向ASR模型的梯度信息;我们提议建立一个以先进的巴耶西亚神经网络(BNN)为基础的高级对抗性探测仪(BNN),该探测器可以模拟适应性对抗性对立性干扰和差异测量的潜在分布;我们进一步模拟以拟议对抗性探测系统为基础的RNNT Transduker、Consuder和Wav2vec-2.0为基础的ASR系统的部署情景;利用拟议的BNNN的探测系统,我们将探测率提高+2.77至+5.42%(对应的+3.03至+6.26%),并将LibSpeech语音数据元错误率降低5.02至7.47%,对比目前的增强性模型方法。