The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.
翻译:使用未经调试的参考样本克隆发言者的声音是现代神经文本到语音方法的伟大进步之一。最近还提出了模仿转录参考音频的假冒方法。在这项工作中,我们第一次通过话语级正常化将这两项任务集中在一起,并同时嵌入一个发音级演讲者。我们进一步引入了一种轻量级匹配器,用于提取精细的分解特征,可以在几秒钟内对单个样本进行微调。我们表明,有可能在不降低质量和高度相似于原始声音和流音的情况下,独立复制发言者的声音和口头引用,正如我们客观的评价和人类研究所显示的那样。我们的所有代码和经过培训的模型都与静态和交互式演示一起可用。