This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A signal resampling method and an image scaling method are implemented in the proposed method to warp the mel-spectrograms or hidden features of the neural vocoder. We also design and open-source a Japanese speech corpus containing three kinds of speaking rates to evaluate the proposed speaking rate control method. Experimental results of comprehensive objective and subjective evaluations demonstrate that 1) the proposed method outperforms a baseline time-scale modification algorithm in speech naturalness, 2) warping mel-spectrograms by image scaling obtained the best performance among all proposed methods, and 3) the proposed speaking rate control method can be incorporated into HiFi-GAN without losing computational efficiency.
翻译:本文展示了一种可调音速控 HIFi-GAN 神经电解码器。 原 HIFi- GAN 是一个高不全度、计算高效和小脚神经电解码器。 我们试图将语音控制功能纳入HIFi- GAN 中,以改善合成言语的无障碍性。 拟议的方法在HiFi- GAN 结构中插入了一种可区分的内插层。 在拟对神经电压的光谱仪或隐藏特征进行扭曲的方法中,采用了一种信号抽查方法和图像缩放方法。 我们还设计并开源了一套日本语音材料,其中包含三种语音率,用以评价拟议语音控制方法。 全面客观和主观评估的实验结果表明:(1) 拟议方法超越了语音自然性质方面基线的时间级修改算法;(2) 通过图像缩放对 mel- 谱仪进行扭曲,在所有拟议方法中取得了最佳性能;(3) 拟议的语音控制方法可以纳入HIFi-GAN,而不会丧失计算效率。