Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.
翻译:尽管最近报告了较快的非AR型模型,但由于使用了蒸馏式培训方法和混合了其他不同培训标准,这些模型可能极其复杂。本研究报告建议采用非AR神经源过滤波形模型,这种模型可以使用基于频谱的培训标准和随机梯度下降法直接培训。考虑到输入声学特征,拟议模型首先使用源模块生成基于正弦的感应信号,然后使用过滤模块将感应信号转换为输出语音波形。我们的实验表明,拟议的模型生成波形的速度至少比AR波网快100倍,其合成语音的质量接近AR波网生成的语音质量。缩影测试结果表明,正弦波感应信号和基于频谱的培训标准对于拟议模型的性能都至关重要。