In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40\%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80\%$ of native single-speaker models, in terms of Mean Opinion Score.
翻译:在这项工作中,我们探索了多种结构和培训程序,以开发多语种和多语言神经TS系统,其目标如下:(a) 当目标语言的数据有限时,提高质量;(b) 能够进行跨语言综合;我们报告在15个不同地方使用8种不同语言的30位发言者进行的一项大型实验的结果;对每个发言者进行同样数量的数据培训;与单一发言人模式相比,建议系统在对发言者进行微调时,大多数情况下的质量都显著提高,而只使用不到40美元的发言者数据来建立单一发言人模式;在跨语言综合中,平均而言,生成的质量在当地单一发言人模式的80美分内。