This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.
翻译:本文将经典的mel- cepstral合成过滤器纳入现代神经语音合成系统,用于终端到终端可控语音合成。 由于Mel- cepstral合成过滤器已明确嵌入拟议系统中的神经波形模型中,因此声音特征和合成语音的阵列分别通过频率扭曲参数和基本频率受到高度控制。我们把Mel- cepstral合成过滤器作为一个不同和对GPU友好的模块,使拟议系统中的声波形模型能够以终端到终端的方式同时优化。 实验显示,拟议的系统从维持可控性的基准系统可以提高语音质量。 实验中使用的核心 PyTorrch 模块将在 GitHub 上公开提供。