Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this paper, we propose a novel federated learning architecture based on continual learning approaches to overcome the difficulties above. Specifically, 1) we use gradual pruning masks to isolate parameters for preserving speakers' tones; 2) we apply selective masks for effectively reusing knowledge from tasks; 3) a private speaker embedding is introduced to keep users' privacy. Experiments on a reduced VCTK dataset demonstrate the effectiveness of FedSpeech: it nearly matches multi-task training in terms of multi-speaker speech quality; moreover, it sufficiently retains the speakers' tones and even outperforms the multi-task training in the speaker similarity experiment.
翻译:联邦学习有助于在严格的隐私限制下对机器学习模式进行协作培训,而联邦文本到语音联盟的目标是综合多种用户的自然语言,并在当地储存其设备中的一些音频培训样本,然而,联邦文本到语音联盟面临若干挑战:每个发言者只有很少的培训样本,培训样本都储存在每个用户的当地设备中,全球模式很容易受到各种攻击。在本文件中,我们提出了一个基于持续学习方法的新颖的联邦学习结构,以克服上述困难。具体地说,1我们使用逐渐修剪的面罩来分离保护发言者的音调的参数;2我们使用选择性的面罩,以便有效地重新利用工作知识;3我们采用私人发言人嵌入式,以保持用户的隐私。对减少的VCTK数据集的实验表明FedSpeech的有效性:它几乎与多语音语言质量方面的多任务培训相匹配;此外,它充分保留了发言者的音调,甚至超越了发言者在类似试验中的多任务培训。