使用未受监督的文本数据对TTS和ASR进行半监督联合培训的演讲人丧失一致性和逐步优化 (Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data)

In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycleconsistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.

翻译：在本文中,我们调查了对语音和自动语音识别文本进行半监督联合培训的情况,在这种培训中,可以找到少量配对数据和大量未配对文本数据。常规研究形成一个称为TTS-ASR管道的周期,多讲者TTS模型将发言从文本中合成,并附有参考发言,而ASR模型则从综合发言中重建文本,此后,两个模型都经过周期一致性损失的培训。然而,综合发言并不反映参考演讲的演讲者特点,综合发言对于ASR模式来说过于容易在培训后被识别。这不仅降低了TTS模型的质量,而且限制了ASR模型的改进。为解决这一问题,我们建议改进基于周期一致性的培训,使发言者的一致性损失与综合发言的特征更接近于参考演讲者发言的特征。在逐步优化中,我们首先冻结TTS模型的参数,然后对两个模型进行培训,以避免过度适应TTS模型的结果。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

72+阅读 · 2022年7月11日

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日