We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
翻译:具体地说,我们用来自现成神经音调调解调模型的离散代码来培训神经编码语言模型(称为Vall-E),并将TTS视为一种有条件的语言建模任务,而不是像以前的工作一样的连续信号回归。在培训前阶段,我们将TTS培训数据提高到比现有系统大数百倍的英语演讲60K小时。Vall-E在文本中出现学习能力,并可用于合成高质量的个人化演讲,只有3秒的注册录音,作为声学提示。实验结果显示,Vall-E在语言自然性和声音相似性方面大大超越了最先进的零弹射 TTS系统。此外,我们发现Vall-E可以保护语言者在合成中声音提示的情感和声学环境。见https://akas/valle 用于我们工作的演示。