Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot work well in this scenario even if the neural vocoder for TTS has achieved significant progress. In this paper, we propose HiFi-WaveGAN which is designed for synthesizing the 48kHz high-quality singing voices from the full-band mel-spectrogram in real-time. Specifically, it consists of a generator improved from WaveNet, a multi-period discriminator same to HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we design a novel auxiliary spectrogram-phase loss to train the neural network, which can also accelerate the training process. The experimental result shows that our proposed HiFi-WaveGAN significantly outperforms other neural vocoders such as Parallel WaveGAN (PWG) and HiFiGAN in the mean opinion score (MOS) metric for the 48kHz SVS task. And a comparative study of HiFi-WaveGAN with/without phase loss term proves that phase loss indeed improves the training speed. Besides, we also compare the spectrogram generated by our HiFi-WaveGAN and PWG, which shows our HiFi-WaveGAN has a more powerful ability to model the high-frequency parts.
翻译:在本文中,我们提议HiFi-WaveGAN, 用于合成全频段Mel-spectrogram的48kHz高品质歌声合成(SVS), 需要一台电码器, 以产生高真知灼见( 例如48kHz) 音频。 然而, 多数文本到语音的voccoder( TTS) 在这种情景下, 即使TTTS的神经电文读数器已经取得了显著的进展, 多数文本到语音的vocards( TTTS) 在这种情景下, 也无法成功运行。 我们提议HiFi- WaveGAN 的音频声音在实时时可以生成高超真知觉( WaveGAN) 。 实验结果表明,我们提议的HiFiFi- WaveGAN 的频率比其他神经系统变光光度变光度变光度的系统变光度变光度变光度变光度变光度变光度变光度变光度变光度变光, 也就是SWiGSWiGSWAVAN的SWIAN的SWIG VALS VALS