The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.
翻译:开发神经蒸馏机(NVs)导致高质量和快速生成波形。然而,传统NVs的目标是单一取样率,在对不同取样率适用时需要再培训。适当的取样率因语言质量和生成速度之间的平衡而从应用到应用各有不同。在本研究中,我们提出一种方法,在一个称为MSR-NV的单一NV的NV中处理多种取样率。通过从低取样率开始逐步生成波形,MSR-NV能够有效地了解每个频率波段的特点,并以多种取样率合成高质量演讲。它可以被视为以前提议的NVs的延伸,在本研究中,我们扩大了平行WaveGAN(PWG)的结构。实验性评价结果表明,拟议方法的主观质量大大高于16、24和48kHz的原始PWG,而不会增加推断时间。我们还表明,MSR-NV可以以较低的取样率利用低的语音,进一步提高合成演讲的质量。