改进平行波音GAN电码器,造成感官加权光谱仪损失 (Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss)

This paper proposes a spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN vocoder successfully generates waveform sequences using a fast non-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteria with a generative adversarial network, the light-weight convolutional networks can be effectively trained without any distillation process. To further improve the vocoding performance, we propose the application of frequency-dependent weighting to the MR-STFT loss function. The proposed method penalizes perceptually-sensitive errors in the frequency domain; thus, the model is optimized toward reducing auditory noise in the synthesized speech. Subjective listening test results demonstrate that our proposed method achieves 4.21 and 4.26 TTS mean opinion scores for female and male Korean speakers, respectively.

翻译：本文为平行波干文字对语音系统提出了光谱域加权技术建议。最近提出的平行波干电解码器成功地使用快速非偏向波浪网模型生成波形序列。通过采用多分辨率短时四面形转换(MR-STFT)标准,并采用基因对抗网络,轻量变压网络可以在没有任何蒸馏程序的情况下得到有效培训。为了进一步改善vocling性能,我们提议对MR-STFT损失功能应用依赖频率的权重。拟议方法对频率域的感知性错误进行处罚;因此,该模型被优化,以减少综合演讲中的听觉噪音。主观听觉试验结果表明,我们拟议的方法分别达到4.21和4.26 TTS表示韩国女性和男性讲者的意见分数。