UTTS: 带有条件分解序列变异自动编码器的不受监督 TTS (UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder)

In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer developed from the perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for the system development. Specifically, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts the FA to the unsupervised alignment (UA). Finally, a Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.

翻译：在本文中,我们提出一个新的不受监督的文本到语音(UTTS)框架,它不需要TTS声学模型(AM)的文本-音频配对。 UTTS是一个多语语音合成器,是从分解的语音代表学习角度开发的。这个框架为TTS推理提供了一个灵活的选择演讲人的时间长度模型、字边特征(特征)和内容。我们利用了在自我监督的语音演示学习以及系统开发的语音合成前端技术方面的最新进展。具体地说,我们利用一个词汇来将输入文本映射到语音序列序列的文字,该词将扩展至基级强制对齐(FA),并使用一个取决于演讲人的发言时间模型。然后,我们开发了一个校正绘图模块,将FAFA转换为不受监督的校正校正校正(UA)。最后,一个调调调调调调调调调调调调自控的自动调调音频的自动编码(C-DSVAE),作为自我监督的 TTSAM, 将预测的UA值和目标图像转换成一个不由我们测量的图像。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日