Several recent studies on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this study, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real time on CPU with comparable quality to an autoregressive counterpart.
翻译:最近几项关于语言合成的研究采用了基因对抗网络(GANs)来生成原始波形。虽然这种方法提高了取样效率和记忆使用,但其样本质量尚未达到自动递减和流动基因模型的质量。在本研究中,我们提议HiFi-GAN实现高效和高不洁的语音合成。由于语音音频由不同时期的正弦信号组成,我们证明模拟音频周期模式对于提高样本质量至关重要。单个发言者数据集的主观人类评价(平均评分、MOS)表明,我们拟议方法在生成22.05千赫兹高不洁音频167.9比单一V100GUP上实时速度快22.05千赫兹高不洁音频167.9倍。我们进一步展示了HiFi-GAN对隐蔽语和端对端语音合成的流光谱的一般性。最后,HiFi-GAN的小型足迹版本比CPU的实时样本生成速度要快13.4倍,质量可与自动递增。