Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.
翻译:语音克隆是学习从几个样本中合成一个隐蔽的发言者的声音的任务。 虽然目前的语音克隆方法在文本到语音合成中为一个新声音取得令人乐观的成果, 但这种方法缺乏控制合成音频的表达性的能力。 在这项工作中,我们提议一种可控的语音克隆方法,使一个隐蔽的发言者能够对合成音频的不同风格进行精细的控制。 我们通过在培训期间将语音合成模型明确设置在发言者编码、 音高等和潜伏风格符号上来实现这一点。 我们通过定量和定性评估,表明我们的框架可用于执行各种表达式的语音克隆任务,而仅使用少数被转录或未被转录的语音样本为新发言者。 这些克隆任务包括参考语言的风格转换,直接从文本中合成语音,以及通过在推断过程中操纵调控变式变量来精细的风格控制。