Training a multi-speaker Text-to-Speech (TTS) model from scratch is computationally expensive and adding new speakers to the dataset requires the model to be re-trained. The naive solution of sequential fine-tuning of a model for new speakers can cause the model to have poor performance on older speakers. This phenomenon is known as catastrophic forgetting. In this paper, we look at TTS modeling from a continual learning perspective where the goal is to add new speakers without forgetting previous speakers. Therefore, we first propose an experimental setup and show that serial fine-tuning for new speakers can result in the forgetting of the previous speakers. Then we exploit two well-known techniques for continual learning namely experience replay and weight regularization and we reveal how one can mitigate the effect of degradation in speech synthesis diversity in sequential training of new speakers using these methods. Finally, we present a simple extension to improve the results in extreme setups.
翻译:从零开始培训多讲者文本到语音(TTS)模式在计算上是昂贵的,在数据集中增加新的演讲者需要重新培训模式。对新演讲者模式进行顺序微调的天真解决办法可能导致新演讲者模式的性能不佳。这一现象被称为灾难性的遗忘。在本文中,我们从不断学习的角度看待TTS模型,目的是在不忘前几位演讲者的情况下增加新的演讲者。因此,我们首先提出一个实验设置,并表明对新演讲者进行序列微调可能导致遗忘前几位演讲者。然后,我们利用两种众所周知的不断学习技术,即经验重放和重量正规化,我们揭示在使用这些方法对新演讲者进行连续培训时如何减轻语音合成多样性退化的影响。最后,我们提出一个简单的扩展,以改进极端设置的结果。