Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial learning scheme. The experimental results show that the proposed method is effective compared to the baseline in terms of the quality and speaker similarity in ZSM-TTS.
翻译:最近提出的几个文本到语音模型(TTS)仍是一项极具挑战性的任务。ZSM-TTS的主要挑战是在新发言者的语音生成时,语言领域的变换问题。为了缓解这一问题,我们提议进行对抗性发言者一致性学习(ASCL)。拟议方法首先在每次培训迭代时使用外部未调出数据集生成一个询问发言者的附加演讲词。然后,模型学会采用对抗性学习计划,不断生成与相应发言者一样的语音样本,通过使用对抗性学习计划嵌入矢量。实验结果显示,拟议方法与ZSM-TTS质量和发言者相似性的基线相比是有效的。