Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture sequence ordering of input speech, and scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pre-trained models are available at https://aka.ms/wavlm.
翻译:自我监督的学习(SSL)在语音识别方面取得巨大成功,而其他语音处理任务则尝试了有限的探索。由于语音信号包含多方面的信息,包括演讲者身份、语言学、口语内容等,学习所有演讲任务的普遍代表性具有挑战性。为了解决这个问题,我们提议了一个新的预培训模式,即WavLM, 以解决全斯塔克下游演讲任务。WavLM 联合学习了掩盖的语音预测和在培训前的解密。通过这个方法,WavLM不仅保持了蒙面语音预测的语音内容建模能力,而且还提高了通过语言解译的非ASR任务的潜力。此外,WavLM为变换结构设置了封闭的相对位置偏差,以更好地捕捉到对投入演讲的顺序,并将培训数据集从60k小时提高到94k小时。WavLM Grand在SUPERB基准上实现了最先进的表现,并大大改进了代表基准上的各种语音处理任务。