We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
翻译:我们提出了适应性文字到语音(TTS)的元学习方法,但数据很少。在培训期间,我们学习了多讲者模式,使用共同的有条件的WaveNet核心和独立学习的嵌入器,培训的目的不是要产生一个固定重量神经网络,然后作为TTS系统部署。相反,目的是产生一个在部署时需要很少数据才能迅速适应新发言者的网络。我们引入并基准三个战略:(一) 学习演讲者嵌入,同时保持WaveNet核心固定,(二) 微调整个结构,使用随机梯子梯子,以及(三) 预测演讲者嵌入经过训练的神经网络编码器。实验表明,这些方法成功地使多讲者神经网络适应新的演讲者,获得最新的自然和声音相似性结果,新演讲者仅提供几分钟的音频数据。