Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.
翻译:自我监督的学习(SSL)在语音识别方面取得巨大成功,而其他语音处理任务则尝试了有限的探索。由于语音信号包含多方面的信息,包括演讲者身份、语言学、口语内容等,学习所有演讲任务的普遍代表性具有挑战性。在本文中,我们提出一个新的预训模式,WavLM,以解决全堆积的下游语音任务。WavLM建于HuBERT框架,重点是口语内容建模和语音身份保护。我们首先为变换器结构配备了门形相对位置的偏差,以提高其在识别任务方面的能力。为了更好的语音歧视,我们提议了超语混合培训战略,在模式培训期间,创建了额外的重叠语句,不受监督地将其纳入其中。最后,我们将培训数据集从公共音频数据的60k小时提高到94k小时,并优化其培训程序,以更好地提取代表性。WavLM 大型软件在SUPERB基准上实现了最先进的业绩,并大大改进了各种语音处理任务。