In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycleconsistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.
翻译:在本文中,我们调查了对语音和自动语音识别文本进行半监督联合培训的情况,在这种培训中,可以找到少量配对数据和大量未配对文本数据。常规研究形成一个称为TTS-ASR管道的周期,多讲者TTS模型将发言从文本中合成,并附有参考发言,而ASR模型则从综合发言中重建文本,此后,两个模型都经过周期一致性损失的培训。然而,综合发言并不反映参考演讲的演讲者特点,综合发言对于ASR模式来说过于容易在培训后被识别。这不仅降低了TTS模型的质量,而且限制了ASR模型的改进。为解决这一问题,我们建议改进基于周期一致性的培训,使发言者的一致性损失与综合发言的特征更接近于参考演讲者发言的特征。在逐步优化中,我们首先冻结TTS模型的参数,然后对两个模型进行培训,以避免过度适应TTS模型的结果。