It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective front-end for robust automatic speech recognition (ASR) in comparison with an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating such divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves $6.28\%$ average word error rate, outperforming the previous best by $19.3\%$.
翻译:事实证明,通过语言增强算法,可以改善吵闹言论的可知性,但是,与直接培训的关于吵闹言论的ASR模式相比,没有将语言增强作为强势自动语音识别(ASR)的有效前端,而ASR则直接培训了ASR模式;语言增强与ASR之间的鸿沟阻碍了强势ASR系统的进展,特别是近年来语音增强取得了长足进步;在这项工作中,我们侧重于消除这种鸿沟,采用ARN(高级经常性网络)基于时间-域域加强模式;拟议的系统完全分离语音增强,并仅对清洁言论进行了声学模型培训;关于CHiME-2的结果表明,ARCN强化的语音转换为改进了ASR结果;拟议的系统平均字差率为6.28美元,比前一个最佳差19.3美元。