This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.
翻译:本文介绍了一种终端到终端跨语言文本到语音的方法,其目的是保护目标语言的发音,而不论原发言者的语言如何,所使用的模型基于非高级塔可坦结构,在这种结构中,解码器已被一个以发言者身份为条件的正常流网络所取代,使TTS和语音转换能够以同一模式进行,因为其固有的语言内容和语体特征分解。在跨语言环境中使用时,最初与目标语言的本地发言者一起制作声学特征,然后由同一模型应用声音转换,以便将这些特征转换成目标发言者的声音。我们通过客观和主观的评价核实,我们的方法与基线的跨语言合成相比,能够产生效益,通过将发言者平均7.5分钟的语音合成,我们还介绍了低资源情景的积极成果。