Automatic speech recognition (ASR) has become increasingly ubiquitous on modern edge devices. Past work developed streaming End-to-End (E2E) all-neural speech recognizers that can run compactly on edge devices. However, E2E ASR models are prone to overfitting and have difficulties in generalizing to unseen testing data. Various techniques have been proposed to regularize the training of ASR models, including layer normalization, dropout, spectrum data augmentation and speed distortions in the inputs. In this work, we present a simple yet effective noisy training strategy to further improve the E2E ASR model training. By introducing random noise to the parameter space during training, our method can produce smoother models at convergence that generalize better. We apply noisy training to improve both dense and sparse state-of-the-art Emformer models and observe consistent WER reduction. Specifically, when training Emformers with 90% sparsity, we achieve 12% and 14% WER improvements on the LibriSpeech Test-other and Test-clean data set, respectively.
翻译:在现代边缘设备中,自动语音识别(ASR)已变得越来越普遍。过去的工作已经开发出可以紧凑地在边缘设备上运行的“端到端”全神经语音识别器。然而,E2E ASR模型容易过度安装,难以对看不见的测试数据进行概括化。已经提出了各种技术,使ASR模型的培训规范化,包括层层正常化、辍学、频谱数据增强和输入速度扭曲。在这项工作中,我们提出了一个简单而有效的噪音培训战略,以进一步改进E2E ASR模型培训。在培训期间,通过在参数空间引入随机噪音,我们的方法可以产生更平稳的趋同模型,使其更加普遍化。我们应用了噪音培训来改进密集和稀疏的先进模型,并观察同步的WER减排情况。具体地说,在对90%松散的Empeech测试和测试-清洁数据集的培训中,我们分别实现了12%和14%的WER改进。