Although the sequence-to-sequence network with attention mechanism and neural vocoder has made great progress in the quality of speech synthesis, there are still some problems to be solved in large-scale real-time applications. For example, to avoid long sentence alignment failure while maintaining rich prosody, and to reduce the computational overhead while ensuring perceptual quality. In order to address these issues, we propose a practical neural text-to-speech system, named Triple M, consisting of a seq2seq model with multi-guidance attention and a multi-band multi-time LPCNet. The former uses alignment results of different attention mechanisms to guide the learning of the basic attention mechanism, and only retains the basic attention mechanism during inference. This approach can improve the performance of the text-to-feature module by absorbing the advantages of all guidance attention methods without modifying the basic inference architecture. The latter reduces the computational complexity of LPCNet through combining multi-band and multi-time strategies. The multi-band strategy enables the LPCNet to generate sub-band signals in each inference. By predicting the sub-band signals of adjacent time in one forward operation, the multi-time strategy further decreases the number of inferences required. Due to the multi-band and multi-time strategy, the vocoder speed is increased by 2.75x on a single CPU and the MOS (mean opinion score) degradation is slight.
翻译:虽然配有关注机制和神经电动读数的顺序到顺序网络在语音合成质量方面取得了巨大进展,但在大规模实时应用程序中仍有一些问题需要解决,例如,为了避免长期的句式调整失败,同时保持丰富的滚动状态,减少计算间接费用,同时确保感知质量;为解决这些问题,我们提议了一个名为Triple M 的实用神经文本到语音系统,由具有多指导关注和多波段多时LPCNet的后继2当量模型组成。前者使用不同关注机制的调整结果指导基本关注机制的学习,只在推断过程中保留基本关注机制。这种方法可以通过吸收所有指导关注方法的优点,同时不改变基本的推导力结构,从而改进文本到功能模块的性能。后者通过将多波段和多时战略结合起来,降低LPCNet的计算复杂性。多波段战略使得LPC网络在每次测试基本关注机制时使用不同关注机制的调整结果,仅保留基本关注机制的基本关注机制,并在推断过程中保留基本关注机制。这种方法可以改进文本到功能模块模块模块的性功能,通过吸收所有指导注意方法的优点,从而进一步预测多波段递递增速度战略的多波段递递递递递增速度。