Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Furthermore, unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.
翻译:我们以前的工作,即统一源过滤器GAN(USFGAN)的vocoder,在平行的波形基因对抗网络中引入基于源过滤器理论的新结构,以实现高声质和声控。然而,高时间分辨率输入导致高计算成本。虽然HiFi-GAN(Vocoder)由于高效的扩增型发电机结构而实现了快速高纤维化语音生成,但音道控制能力受到严重限制。为了实现快速和可投放控制高纤维性神经电算器,我们将源过滤器理论引入HiFi-GAN(HiFI-GAN),通过从等级上将重音过滤网络以高频源源提示信息为基础进行调整。根据实验结果,我们提出的方法在单一的CPU上以语音质量和合成速度的语音生成比HIFI-GAN(USFGAN)和USFGGAN(USFGAN)高。此外,与USFGAN(USFGAN)vocoder不同的是,拟议方法可以很容易在实时应用和终端到终端系统中采用/整合。</s>