Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.
翻译:新发言者的神经文字和语音培训模式通常需要数小时高质量的语音数据。以前关于语音克隆的工作试图应对这一挑战,办法是利用新发言者几分钟的语音数据,将经过训练的多语种TTS模型改制成新声音,使用新发言者几分钟的语音数据。然而,公开提供的大型多语种数据集往往很吵,因此产生了不适合产品使用的TTS模型。我们通过提出用于新发言者的高质量单语种TTS模型的转让-学习准则来应对这一挑战,只使用几分钟的语音数据。我们利用不同数量的数据为新发言者进行广泛研究,并以自然性和声音/风格与目标发言者相似的方式评价合成的语音。我们发现,在仅仅30分钟的数据中微调单语种TTS模型,可以产生与在27小时以上的数据中从零开始对男女目标发言者进行训练的模型的类似性能。