This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.
翻译:这项工作探索了以不存在的人类声音合成语言的任务。 我们将此任务称为“发言一代 ”, 并展示了TacoSpawn, 这个系统在这项任务中具有竞争性。 TacoSpawn是一个反复出现的基于关注的文本对语音的模式,它学习了在发言人嵌入空间上的分布,从而可以对新型和多样化的发言者进行取样。 我们的方法很容易实施,不需要从发言者身份系统转移学习。 我们提出了评估这项工作绩效的客观和主观的衡量标准,并表明我们拟议的客观衡量标准与人类对发言者相似性的看法相关。 我们的演示页面上可以找到音频样本。