In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.
翻译:在本文中,我们介绍SANE-TTS这个稳定和自然的端到端的多语种TTS模式。由于很难为特定发言者获得多语种的TTS模式,因此培训多语种TTS模式是不可避免的。我们引入了在跨语言合成和其他多语种TTS模式中应用的对抗性培训中提高语言自然性的演讲人正规化损失。此外,通过增加演讲人正规化损失,取代在持续时间预测中以零矢量嵌入的演讲人稳定了跨语言的推论。有了这一替代,我们的模型生成了节奏适度的演讲,而不论跨语言合成中的来源演讲人如何。在MOS评价中,SANE-TTS在跨语言和语言内部综合中都取得了3.80以上的自然性评分,而地面事实评分为3.99。此外,SANE-TTS保持演讲人与地面真相的相似性,即使在交叉语言推论中也是如此。我们网页上有音样。