Automatic speech recognition (ASR) systems degrade significantly in face of noisy conditions. Recently, speech enhancement (SE) has been introduced as front-end module to reduce noise and improve speech quality for ASR, but it would also suppress some important speech information, i.e., over-suppression problem. To alleviate this, we propose a dual-path style learning approach for end-to-end noise-robust automatic speech recognition (DPSL-ASR). Specifically, we first introduce clean speech feature along with the fused feature from previously proposed IFF-Net as dual-path inputs to recover the over-suppressed information. Then, we propose a style learning method to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i.e., clean "speech style". Furthermore, we employ consistency loss to minimize the distance of ASR outputs in two paths to improve noise-robustness. Experimental results show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline, on RATS Channel-A and CHiME-4 1-Channel Track datasets, respectively. Visualizations of intermediate embeddings indicate that DPSL-ASR can recover abundant over-suppressed information in enhanced speech. Our code is available at GitHub: https://github.com/YUCHEN005/DPSL-ASR.
翻译:面对吵闹的情况,自动语音识别系统(ASR)在出现噪音的情况下显著退化。最近,语音增强系统(SE)被引入了前端模块,以减少噪音,提高ASR的语音质量,但也会压制一些重要的语音信息,即过度压缩问题。为了缓解这一问题,我们提议了一种双向式学习方法,用于终端到终端噪音-紫色自动语音识别(DPSL-ASR),具体地说,我们首先引入了清洁的语音特征,同时引入了先前提议的IFF-Net的引信功能,作为恢复压抑性信息的双向式输入。然后,我们提出了一种风格学习方法,用于绘制离清洁功能近处的连接功能,以便从后者学习潜在语音信息,即“声音”型。此外,我们采用一致性损失,将ASR输出在两条道路上的距离降至最小,以改善噪音。实验结果表明,拟议的方法比IFF-网络最佳语音定位基线减少10.6%和8.6%,在RA-A和CHME-SIMAS 高级数据库中,可以恢复我们O-IS-SLA-SAL-D-D-D-SLA-SLA和CHNBAR-D-SLA-D-D-D-SLA-SAR-D-D-SARDRBSDR-DR-SDR-SDRBR-SD-SM-SB/C/CSDRBRBSM-SDR-SBAR 的升级数据。