Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pretrained models are available at https://aka.ms/wavlm.
翻译:自我监督的学习(SSL)在语音识别方面取得巨大成功,而其他语音处理任务则尝试了有限的探索。由于语音信号包含多方面的信息,包括演讲者身份、语言学、口头内容等,因此学习所有演讲任务的普遍代表性具有挑战性。在本文中,我们提出一个新的预先培训模式,即WavLM,以解决全斯塔克下游演讲任务。WavLM以HuBERT框架为基础,以语音内容建模和语音身份保护为重点。我们首先为变换器结构配备了门形相对位置的偏差,以提高其识别任务的能力。为了更好的语音歧视,我们提议了超语混合培训战略,在模式培训期间,创建了额外的重叠语句。最后,我们将培训数据集从60k小时提高到94k小时。WavLM大型公司在SUPERB基准上实现了最先进的表现,并为各种语音处理任务带来重大改进。在https://kams/wavm上提供了代码和预先培训模式。