In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker's data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
翻译:在开发神经文本到语音系统的过程中,使用大量非目标发言者数据进行示范预培训是一种共同的做法,但是,就最终达到的目标发言者系统性能而言,示范预培训的实际效益不确定和不稳定,在很大程度上取决于培训数据的数量和文字内容。本研究报告旨在更好地了解为什么和如何示范预培训能够积极促进TTS系统的性能。本研究报告假定,培训前过程在学习与文本有关的言论变异方面发挥着关键作用,而进一步培训目标发言者数据的目的是捕捉与发言者有关的变异。制作不同的测试组在文本内容方面与目标发言者数据具有不同程度的相似性。实验表明,利用受过不同文字内容语言数据培训的讲者TTS可以改进关于域式超文本文本的TTS。我们还试图减少新的文本域的培训前数据量,并改进数据和计算效率。发现,在培训前数据缩小到原规模1/8时,TTS系统可以实现可比较的性能。