This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech. As each discriminator learns the distinctive characteristics of the harmonic and noise components, respectively, the adversarial training process becomes more efficient, allowing the generator to produce more realistic speech waveforms. Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems. In particular, our speaker-independently trained model within a FastSpeech 2 based text-to-speech framework achieves the mean opinion scores of 4.20, 4.18, 4.21, and 4.31 for four Japanese speakers, respectively.
翻译:本文建议对以波格安为基基的平行波形合成系统进行有声调有条件的有条件歧视。 在这个框架内,我们采用了一种基于预测的调节方法,可以显著改善歧视者的性能。此外,常规歧视者被分为两种波形歧视,用于模拟表达和无声演讲。随着每个歧视者分别了解口音和噪音组成部分的特性,对抗性培训过程变得更加有效,使生成者能够产生更现实的语音波形。主观测试结果表明拟议方法优于传统的平行波格安和波格网系统。特别是,我们的演讲者在快速语音2的文本到语音框架内独立培训的模型分别达到4.20、4.18、4.21和4.31的日本4位发言者的平均意见分数。