While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, we show that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language. We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker using objective metrics as well as human studies and provide our code and trained models open source.
翻译:虽然即使在零发式情况下,在模拟多发式发言者(TTS)的神经方法方面都取得了巨大进展,但这些方法所需的数据对于世界上绝大多数6 000多种口语来说一般是不可行的,在这项工作中,我们把零发式语音克隆和多语种低资源TTS的任务汇集在一起。我们利用语言不可知元学(LAML)程序和对TTS编码器的修改,表明一个系统可以学习使用仅仅5分钟的培训数据来讲新语言,同时保留用新学的语言来推断甚至看不见的发言者的声音的能力。我们展示了我们在智能、自然性和相似性方面拟议方法的成功,即利用客观指标和人类研究来针对演讲人,提供我们的代码和经过培训的模式开放源。