The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.
翻译:尽管在两个阶段 -- -- 声音模型和vocoder -- -- 都存在在零点情景下调整新声音的挑战,但以往的工作通常只考虑一个阶段的问题。在本文件中,我们将我们以前的Glow-WaveGAN和Glow-WaveGAN 2推广到Glow-WaveGAN 2, 目的是解决两个阶段的高质量零点文本到语音和任何可逆语音转换的问题。我们首先可以建立一个通用的WaveGAN模型,用于提取潜在语音分配(z)美元,并从中重建波形。然后,基于流的音模型只需要从文本中学习同样的$p(z)美元。这自然避免了音模型和电码之间的不匹配,导致在没有模型调整的情况下高质量生成的语音。基于连续的语音空间和可逆的语音属性,可以为任何发言者获得有条件的分发,因此我们可以进一步进行高质量的零点T-C语言分配,从而可以进一步为新的G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-G-G-G-G-G-G-G-G-G-G-L-G-G-G-G-G-G-G-G-L-L-G-G-G-G-G-L-G-G-L-L-G-L-G-G-G-L-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-G-G-G-L-L-L-L-L-L-G-G-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-G-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-B-B-B-B-B-B-B