Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters of a pretrained multi-speaker backbone model. However, serving hundreds of fine-tuned neural TTS models is expensive as each of them requires significant footprint and separate computational resources (e.g., accelerators, memory). To scale speaker adapted neural TTS voices to hundreds of speakers while preserving the naturalness and speaker similarity, this paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters. This architecture allows the backbone model to be shared across different target speakers. Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches, while requiring only $\sim$0.1% of the backbone model parameters for each speaker.
翻译:将神经文本到语音(TTS)模式适应目标演讲人,通常需要微调经过预先训练的多发言人主干模型的大部分参数(如果不是全部参数的话),然而,为数百个经过微调的神经TTS模型提供服务费用昂贵,因为每个模型都需要大量的足迹和不同的计算资源(例如加速器、记忆),为了使发言者将神经TTS声音与数百个发言者相适应,同时保持自然性和发言者的相似性,本文件建议对主干模型进行具有参数效率的微小显示器调整,即用称为剩余适配器的可训练轻型模块加强主干模型,使主干模型能够由不同的目标演讲人共享。实验结果表明,拟议的方法与完全微调方法相比,具有竞争性的自然性和声音相似性,而每个发言者只需占主干模型参数的0.1%。