Although recent works on neural vocoder have improved the quality of synthesized audio, there still exists a gap between generated and ground-truth audio in frequency space. This difference leads to spectral artifacts such as hissing noise or robotic sound, and thus degrades the sample quality. In this paper, we propose Fre-GAN which achieves frequency-consistent audio synthesis with highly improved generation quality. Specifically, we first present resolution-connected generator and resolution-wise discriminators, which help learn various scales of spectral distributions over multiple frequency bands. Additionally, to reproduce high-frequency components accurately, we leverage discrete wavelet transform in the discriminators. From our experiments, Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio while outperforming standard models in quality.
翻译:虽然最近神经电解器工程提高了合成音频的质量,但频率空间中生成的音频和地面真实音频之间仍然存在着差距。这种差异导致光谱工艺品,如刺耳噪音或机器人声音,从而降低样本质量。在本论文中,我们提议Fre-GAN实现频率一致的音频合成,其生成质量得到高度改进。具体地说,我们首先推出分辨率相关生成器和分辨率识别导师,这有助于在多个频带中了解频谱分布的不同尺度。此外,为了准确复制高频元件,我们利用导体中离散波子的变异作用。从我们的实验中,Fre-GAN实现了高频波形生成,与地面音频相比,只有0.03 MOS差,质量高于标准模型。