The task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data. This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time. Accelerating the adaptation process for a new target speaker is of importance in real-world applications, but even more challenging. In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning. We investigate the model-agnostic meta-learning (MAML) algorithm and meta-transfer a pre-trained multi-speaker and multi-prosody base TTS model to be highly sensitive for adaptation with few samples. Domain adversarial training mechanism and orthogonal constraint are adopted to disentangle speaker and prosody representations for effective cross-speaker style transfer. Experimental results show that the proposed approach is able to conduct fast voice cloning using only 5 samples (around 12 second speech data) from a target speaker, with only 100 adaptation steps. Audio samples are available online.
翻译:在文本到语音合成(TTS)中为语音克隆进行微小风格传输的任务,是利用非常有限的中性数据,将任意源发言者的语音风格转换成目标发言者的语音风格,这是一项非常富有挑战性的任务,因为学习算法需要同时处理微小声音克隆和语音-外观脱钩。在现实世界应用中,加快新目标发言者的适应进程很重要,但甚至更具挑战性。在本文中,我们处理使用元学习为语音克隆任务进行硬性快速微小风格传输的问题。我们调查了模型-敏感元学习(MAML)算法和元传输一个经过预先训练的多语音和多质基 TTS 模型,以便能对少数样本的适应性高度敏感。在现实世界应用中,对新目标发言者的调控机制以及矩约束是解开音器和Prosody演示,以有效跨语音风格传输。实验结果显示,拟议方法只能使用5个样本(大约12个第二语音数据)进行快速语音克隆。只有100个音频样品可在线应用。