This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information contained in MFCCs is converted to all-pole filters, and a pitch-synchronous excitation model matched to these filters is trained. Finally, we introduce a generative adversarial network -based noise model to add a realistic high-frequency stochastic component to the modeled excitation signal. The results show that high quality speech reconstruction can be obtained, given only MFCC information at test time.
翻译:本文提出一种方法,用过滤器库中频缓冲系数(MFCC)生成语音,该系数广泛用于语言应用,如ASR,但一般认为不能用于语言合成。首先,我们预测基本频率和从带有自动递减性经常性神经网的MFCC获取的信息。第二,MFCC中所含的光谱信封信息被转换成全极过滤器,并培训了与这些过滤器相匹配的静音同步引力模型。最后,我们引入了基于基因的对抗性网络----基于噪音模型,在模拟引力信号中添加一个现实的高频切片部分。结果显示,只要测试时只有MFCC信息,就可以实现高质量的语音重建。