Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction. This framework is optimized based on the feedback of the speaker identification task and the high-level perceptual deviation between the raw speech signal and its noisy version. We conducted speaker verification tasks in both noisy and clean environment respectively to evaluate our system. Compared to the baseline, our method shows better performance in both clean and noisy environments, which means our method can not only enhance the speaker relative information but also avoid adding distortions.
翻译:语音增强的目的是通过抑制背景噪音来提高语音信号的感官质量;然而,过度压制可能导致语言扭曲和音频信息丢失,从而降低发言者嵌入的音效;为缓解这一问题,我们提议一个称为PL-EESR的端到端深学习框架,以强有力地提取语音代表;根据发言者识别任务的反馈以及原始语音信号与其噪音版本之间的高层次概念偏差,优化这一框架;我们在吵闹和清洁的环境中分别执行语音核实任务,以评价我们的系统;与基线相比,我们的方法显示在清洁和吵闹环境中的更好表现,这意味着我们的方法不仅可以加强发言者的相对信息,而且可以避免增加扭曲。