Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion. On the other hand, it was shown that self-supervised pre-training enables the utilization of a large amount of unlabeled noisy data, which is rather beneficial for the noise robustness of ASR. However, the potential of the (optimal) integration of SE and self-supervised pre-training still remains unclear. In order to find an appropriate combination and reduce the impact of speech distortion caused by SE, in this paper we therefore propose a joint pre-training approach for the SE module and the self-supervised model. First, in the pre-training phase the original noisy waveform or the waveform obtained by SE is fed into the self-supervised model to learn the contextual representation, where the quantified clean speech acts as the target. Second, we propose a dual-attention fusion method to fuse the features of noisy and enhanced speeches, which can compensate the information loss caused by separately using individual modules. Due to the flexible exploitation of clean/noisy/enhanced branches, the proposed method turns out to be a generalization of some existing noise-robust ASR models, e.g., enhanced wav2vec2.0. Finally, experimental results on both synthetic and real noisy datasets show that the proposed joint training approach can improve the ASR performance under various noisy settings, leading to a stronger noise robustness.
翻译:(SE)通常需要强化言语,作为改善噪音环境中言语质量的前端,因为由于言语扭曲,强化的言语可能不是自动语音识别系统的最佳方法。另一方面,据显示,自监管的训练前前训练能够利用大量未贴标签的噪音数据,这对ASR的噪音强力相当有益。然而,SE的(最佳)整合和自我监管的训练前训练的潜力仍然不明确。为了找到适当的组合,并减少SE造成的言语扭曲的影响,我们因此在本文中提议对SE模块和自监管模式采用联合培训前的处理方法。首先,在培训前阶段,最初的噪音波形或SE获得的波形将输入自监管模型,学习背景代表,而量化的清洁言语作为目标。 其次,我们提议一种双调混合方法,将噪音和强化的言语特征结合起来,这可以弥补因单独使用单个模块而造成的信息损失。 由于灵活地利用强化的精度联合培训模式,因此,ARC将现有的高压/高压模式转化为强化的实验方法。