多语种和零热多语种低资源多语种和零热多语种TTTS (Low-Resource Multilingual and Zero-Shot Multispeaker TTS)

While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, we show that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language. We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker using objective metrics as well as human studies and provide our code and trained models open source.

翻译：虽然即使在零发式情况下,在模拟多发式发言者(TTS)的神经方法方面都取得了巨大进展,但这些方法所需的数据对于世界上绝大多数6 000多种口语来说一般是不可行的,在这项工作中,我们把零发式语音克隆和多语种低资源TTS的任务汇集在一起。我们利用语言不可知元学(LAML)程序和对TTS编码器的修改,表明一个系统可以学习使用仅仅5分钟的培训数据来讲新语言,同时保留用新学的语言来推断甚至看不见的发言者的声音的能力。我们展示了我们在智能、自然性和相似性方面拟议方法的成功,即利用客观指标和人类研究来针对演讲人,提供我们的代码和经过培训的模式开放源。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

机器学习组合优化

专知会员服务

110+阅读 · 2021年2月16日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日