We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.
翻译:我们努力建立一个多语种语言合成系统,可以在保持个人声音特点的同时以适当的口音生成语音,这具有挑战性,因为以多种语言获得双语培训数据费用昂贵,而缺乏这些数据导致语言、语言和口音交织在一起的紧密关联,导致转移能力差。要解决这个问题,我们展示了一个基于RADTTTS的多语种、多语种、多语种语音合成模型,明确控制口音、语言、语种和精细加精美的$F_0美元和能量特征。我们提议的模型并不依赖双语培训数据。我们展示了在由7个口音组成的开放源数据集中为任何发言者控制合成口音的能力。人类主观评价表明,我们的模型比控制基线更好地保留发言者的语音和口音质量,同时合成所有目标语言的流言和我们数据集中的口音。