There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success. Especially, WavLM showed state-of-the-art performance on various speech processing tasks. To better understand the efficacy of self-supervised learning models for speech enhancement, in this work, we design and conduct a series of experiments with three resource conditions by combining WavLM and two high-quality speech enhancement systems. Also, we propose a regression-based WavLM training objective and a noise-mixing data configuration to further boost the downstream enhancement performance. The experiments on the DNS challenge dataset and a simulation dataset show that the WavLM benefits the speech enhancement task in terms of both speech quality and speech recognition accuracy, especially for low fine-tuning resources. For the high fine-tuning resource condition, only the word error rate is substantially improved.
翻译:近些年来,WavLM公司成功地取得了巨大的成功,因此对以自我监督方式进行终端到终端语音编码的学习方法的兴趣激增。特别是,WavLM公司在各种语音处理任务上展示了最先进的表现。为了更好地了解以自我监督方式进行语音强化学习模式的效力,我们在这项工作中设计并进行一系列实验,结合WavLM公司和两个高质量的语音强化系统,在三种资源条件下将WavLM公司与两个高质量的语音强化系统结合起来。此外,我们提议采用基于回归的WavLM公司培训目标和噪音混合数据配置,以进一步提升下游增强性能。DNS挑战数据集和模拟数据集的实验显示,WavLM公司在语音质量和语音识别准确性两方面都有利于增强语音能力的任务,特别是对于低微调资源而言。对于高微调的资源条件,只有字词错误率得到大幅提高。