Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of acoustic models in hybrid ASR. Our noise vectors are obtained by combining the means of speech frames and silence frames in the utterance, where the speech/silence labels may be obtained from a GMM-HMM model trained for ASR alignments, such that no extra computation is required beyond averaging of feature vectors. We show through experiments on AMI and Aurora-4 that this simple adaptation technique can result in 6-7% relative WER improvement. We implement several embedding-based adaptation baselines proposed in literature, and show that our method outperforms them on both the datasets. Finally, we extend our method to the online ASR setting by using frame-level maximum likelihood for the mean estimation.
翻译:环境噪音和反响对自动语音识别系统(ASR)的性能有不利影响。使用神经网络声学模型的多条件培训来处理这一问题,但需要多倍的数据增强,从而增加培训时间。在本文中,我们提议为混合ASR的声学模型进行有噪音意识的超音矢量培训。我们的噪声矢量是通过在发声中将语音框架和静默框手段结合起来获得的,这里的语音/静默标签可以从为ASR校准而培训的GM-HMM模型中获得,因此不需要除地貌矢量平均值以外的额外计算。我们通过AMI和Aurora-4的实验显示,这种简单的适应技术可以导致6-7%的相对WER改进。我们实施了文献中提议的几项基于嵌入式的适应基线,并表明我们的方法在这两个数据集上都优于它们。最后,我们将我们的方法扩大到在线的ASR设置,方法是利用框架水平的最大可能性进行平均估计。