This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually. It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden. One of the challenges of lifelong learning methods is "catastrophic forgetting": in TTS scenario it means that model performance quickly degrades on previous languages when adapted to a new language. We approach this problem via a data-replay-based lifelong learning method. We formulate the replay process as a supervised learning problem, and propose a simple yet effective dual-sampler framework to tackle the heavily language-imbalanced training samples. Through objective and subjective evaluations, we show that this supervised learning formulation outperforms other gradient-based and regularization-based lifelong learning methods, achieving 43% Mel-Cepstral Distortion reduction compared to a fine-tuning baseline.
翻译:这项工作提出了一种终身学习方法,用于培训多语种文本到语音系统(TTS),其中每种语言都被视为一项单独的任务,并连续不断地学习。它并不需要所有语言的集合数据,因此减轻储存和计算负担。终身学习方法的挑战之一是“灾难性的遗忘”:在TS的设想中,它意味着模型在适应新语言时会迅速降低前几种语言的成绩。我们通过基于数据重放的终身学习方法处理这一问题。我们把重播过程设计成一个受监督的学习问题,并提出一个简单而有效的双读器框架,以解决语言高度平衡的培训样本。我们通过客观和主观的评价,表明这种受监督的教学方法比其他基于梯度和基于正规化的终身学习方法要优于其他梯度和基于正规化的终身学习方法,比微调基准降低43%的梅尔-中链扭曲。