Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.
翻译:建筑多声音神经网络的文本到语音合成系统通常取决于每个发言者能否提供大量高质量的录音资料,并使培训过程以发言者的身份或所了解的情况为条件,然而,如果每个发言者没有多少数据,或发言者人数有限,多声音 TTS可能难以培训,导致发言者的相似性和自然性差。为了解决这一问题,我们探讨两个方向:通过附加一个损失词,迫使网络学习更好的发言者身份说明;利用波形操纵方法,增加与每个发言者有关的输入数据。我们表明,在用客观和主观措施评价这两种方法时,都十分有效。额外的损失术语有助于发言者的相似性,而数据增强则提高了多声音 TTS系统的智能性。