Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.
翻译:个性化TTS是一项激动人心和高度需要的应用,它允许用户使用仅有几次录音来训练他们的TTS声音。然而,TTS训练通常需要数小时的录音和一个大模型,使其不适合在移动设备上部署。为了克服这个限制,相关工作通常需要微调预训练的TTS模型,以保持其生成高质量音频样本的能力,同时适应目标说话人的声音。这个过程通常被称为“声音克隆(voice cloning)”。尽管相关工作在转换TTS模型的声音方面取得了显著成功,但仍需要从一个大型预训练模型微调,从而导致声音克隆模型的大小相对较大。在本文中,我们提出了将可训练的结构化剪枝应用于声音克隆。通过使用声音克隆数据训练结构化剪枝掩码,我们可以为每个目标说话人产生惟一的裁剪模型。我们的实验表明,使用可学习的结构化剪枝,我们可以将模型大小压缩到原来的 7 倍,同时实现可比较的声音克隆性能。