Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.
翻译:自我监督的语音预培训使模型具备了语言信号所固有的背景结构, 而自我监督的文本预培训则赋予了语言信息模式以授权模式。 两者都有利于下游语言任务, 如 ASR 。 然而, 不同的培训前目标使得联合优化同一模式中的语音和文本表达方式具有挑战性。 为了解决这个问题, 我们提议文本强化的自我监督的语音预培训( TESSP), 目的是将语言信息纳入语言信号预培训。 我们的模式由三部分组成, 即 语音编码器、 文本编码器和一个共享的编码器。 该模型使用不受监督的语音和文本数据数据作为输入, 并分别利用共同的 HuBERT 和 MLM 损失。 我们还提议用电话更新和演示演示, 以便联合制作演讲和文本信息。 具体地, 我们用语音预言和文本转换的模型对语音预变异式进行调, 我们用语音预变换的文本和图像转换到演示的缩略图。