In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.
翻译:在跨语文的语音合成中,可以将各种语言的演讲合成为单声调发言者,通常只有单声调发言者的数据可供示范培训使用,因此,在综合的跨语文发言和母语录音之间,发言者的相似性相对较低。根据多语种变压器文本到语音模式,本文件研究一个多语种学习框架,以改善跨语种发言者的相似性。为了进一步改善发言者的相似性,建议与发言者分类师进行联合培训。在这里,提议采用一个类似于平行的分类方法,对变压器模式进行培训,以避免在引入联合培训时打破平行的培训机制。通过在主观和客观的评价中采用多语种学习和语言分类联合培训,可以不断改进跨语种语言的类似性,既适用于在培训组中看到的发言者,也适用于在培训组中看不见的发言者。