We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.
翻译:我们提出跨语言神经规范语言模型,VALL-E X,用于跨语言语言合成。具体地说,我们推广VALL-E,并培训多语言有条件的多语言代码语言模型,通过使用源语言语言讲话和目标语言文本作为提示,预测目标语言语言语言语言语言语言的声象序列。VALL-E X继承了很强的文字学习能力,可用于零发跨语言文本对语音合成和零发语音对语音翻译任务。实验结果显示,它可以通过源语言只用一种语言发出高质量语言的高质量演讲,作为快速的发音,同时保护隐匿语言的声音、情感和声响音环境。此外,VALLLE-E X有效地缓解了外国口音问题,这些口音可以通过语言识别来控制。声音样本可在以下https://aka.ms/vallex}查阅。</s>