Improving the accuracy of single-channel automatic speech recognition (ASR) in noisy conditions is challenging. Strong speech enhancement front-ends are available, however, they typically require that the ASR model is retrained to cope with the processing artifacts. In this paper we explore a speaker reinforcement strategy for improving recognition performance without retraining the acoustic model (AM). This is achieved by remixing the enhanced signal with the unprocessed input to alleviate the processing artifacts. We evaluate the proposed approach using a DNN speaker extraction based speech denoiser trained with a perceptually motivated loss function. Results show that (without AM retraining) our method yields about 23% and 25% relative accuracy gains compared with the unprocessed for the monoaural simulated and real CHiME-4 evaluation sets, respectively, and outperforms a state-of-the-art reference method.
翻译:在噪音条件下,提高单通道自动语音识别(ASR)的准确性是困难的。强大的语音强化前端可以提供,但通常需要对ASR模型进行再培训,以适应处理工艺品。在本文中,我们探索了在不对声学模型(AM)进行再培训的情况下提高识别性能的扩音强化战略。这是通过将增强信号与未经处理的输入重新组合以缓解处理工艺品而实现的。我们使用一个DNN发言人提取语音代言器来评估拟议方法,该方法经过培训,具有一种感知性动力损失功能。结果显示(不进行AM再培训),我们的方法与单声学模拟和真实CHimME-4评价未处理的组合相比,分别取得了约23%和25%的相对准确性增益,并超越了一种最先进的参考方法。