To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without the need of per-language resources like lexicon, extra corpus, auxiliary models, or linguistic expertise, thus ensuring scalability. While it retains satisfactory intelligibility and naturalness matching rich-resource models. Exhaustive comparative and ablation studies are performed to reveal the potential of the framework for low-resource languages. Furthermore, we propose a novel method to extract language-specific sub-networks in a multilingual model for a better understanding of its mechanism.
翻译:为了将神经语言合成规模扩大到各种现实世界语言,我们提出了一个多语种端对端框架,用于绘制光谱的字面输入,从而允许任意输入文字。除了40+语言的强力成果外,该框架还展示了适应极端低资源、甚至短视的仅40秒录音记录情景下的新语言的能力,而不需要词汇、外体、辅助模型或语言专门知识等各种语言资源,从而确保可扩展性。虽然它保留了与丰富资源模型相匹配的令人满意的智能和自然性。进行了全面比较和对比研究,以揭示低资源语言框架的潜力。此外,我们提出了一种新颖的方法,在多语言模型中提取特定语言的子网络,以更好地了解其机制。