Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.
翻译:在培训前自然语言处理模型中,蒙面语言建模~(MLMM)的成功激励下,我们提出W2v-BERT,探索MLM进行自我监督的语音演示学习。w2v-BERT是一个将对比性学习和MLM相结合的框架,前者将输入连续语音信号分解成一套有限的歧视性演讲符号,而后者则通过解决一个遮面的预测任务来学习背景化的语音演示。与现有的基于 MLM 的语音预培训框架(如HuBERT,它依赖于迭代性再组合和再培训进程)相比,W2v-BERT是一个框架。W2V-BER可以将输入的连续语音信号分解成一套有限的带有歧视性演讲符号,而后者则同时通过解决两个自上下限的任务 ~(基于模型的任务和MLMM) 来培训模式。当我们应用W2V-BERT在当前的状态下取得竞争结果,当我们使用特定的测试模型时,在LISS- 之前的精确度测试模型上, 以LISS- s sexeral lab-real-real-real-real-real-real-leg-lection modal modal modal ladal lax ladddal lax lax lax lax liddddalddddddddddddaldddal_ ladal ladddddddddddddal_