使用大型无标签演讲单位的低资源文字到语音的转让学习框架 (Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus)

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.

翻译：培训文本到语音模式( TTS) 需要大规模文本标记的语音材料, 这很难收集。在本文中, 我们建议为 TTS 提供一个传输学习框架, 使用大量未贴标签的语音数据集进行预培训。使用 wav2vec2. 0 表达方式, 无标签的演讲可以大大改善性能, 特别是在没有标签的演讲的情况下。我们还将建议的方法推广到零弹多发多发语音 TTS( ZS- TTS ) 。实验结果验证了拟议方法在自然性、智能和语音概括方面的有效性。我们强调, 单发演讲的 TTS 模型只对标签数据集的10分钟进行微调, 超越了其他基线, 而 ZS- TTS 模型则只对单发30分钟的发言者数据集进行微调。我们通过对未贴标签的多发声材料进行预先培训, 可以产生任意发声者的声音。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。