TESSP: 增强文本的自我强化发言预培训 (TESSP: Text-Enhanced Self-Supervised Speech Pre-training)

Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.

翻译：自我监督的语音预培训使模型具备了语言信号所固有的背景结构, 而自我监督的文本预培训则赋予了语言信息模式以授权模式。两者都有利于下游语言任务, 如 ASR 。然而, 不同的培训前目标使得联合优化同一模式中的语音和文本表达方式具有挑战性。为了解决这个问题, 我们提议文本强化的自我监督的语音预培训( TESSP), 目的是将语言信息纳入语言信号预培训。我们的模式由三部分组成, 即语音编码器、文本编码器和一个共享的编码器。该模型使用不受监督的语音和文本数据数据作为输入, 并分别利用共同的 HuBERT 和 MLM 损失。我们还提议用电话更新和演示演示, 以便联合制作演讲和文本信息。具体地, 我们用语音预言和文本转换的模型对语音预变异式进行调, 我们用语音预变换的文本和图像转换到演示的缩略图。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日