This study aims to improve the performance of automatic speech recognition (ASR) under noisy conditions. The use of a speech enhancement (SE) frontend has been widely studied for noise robust ASR. However, most single-channel SE models introduce processing artifacts in the enhanced speech resulting in degraded ASR performance. To overcome this problem, we propose Signal-to-Noise Ratio improvement (SNRi) target training; the SE frontend automatically controls its noise reduction level to avoid degrading the ASR performance due to artifacts. The SE frontend uses an auxiliary scalar input which represents the target SNRi of the output signal. The target SNRi value is estimated by the SNRi prediction network, which is trained to minimize the ASR loss. Experiments using 55,027 hours of noisy speech training data show that SNRi target training enables control of the SNRi of the output signal, and the joint training reduces word error rate by 12% compared to a state-of-the-art Conformer-based ASR model.
翻译:这项研究的目的是在吵闹的条件下提高自动语音识别(ASR)的性能; 广泛研究使用扩音前端(SE)对噪音强的ASR进行强化研究; 然而,大多数SE型单频道模型在强化语音中引入处理工艺品,导致ASR性能退化; 为了解决这一问题,我们提议进行信号到噪音比对比目标培训; SE 前端自动控制其降低噪音水平,以避免因工艺品而降低ASR性能的人格; SE 前端使用了代表输出信号目标SNRI的辅助电弧输入; SNI 预测网络估算了SNI 目标值,该预测网络受过培训,以尽量减少ASR损失; 使用55 027小时的噪音语言培训实验显示,SNRI目标培训能够控制输出信号的SNRI, 联合培训比基于最新技术的ASR模型减少字差率12%。