Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, this paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. First, a new continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise and letting an accurate reconstruction of noise characteristics. Second, we addressed the need of neural sequence to sequence modeling approach for the task of TTS based on recurrent networks. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human. The evaluation results proved that the proposed model achieves the state-of-the-art performance of the speech synthesis compared with the other traditional methods.
翻译:作为统计参数文本到语音合成和语音转换系统的主要组成部分,Vocoders重新受到关注,尽管有电码技术提供了几乎为人接受的合成语音,但其高计算复杂性和不规则结构仍被视为具有挑战性的关切问题,造成各种声音质量的退化,因此,本文件以连续的vocoder展示了新技术,即所有特征都是连续的,并提供了一个灵活的语音合成系统。首先,提议以阶段扭曲为基础进行新的连续的噪音掩蔽,以消除残余噪音的感知影响,并允许准确重建噪音特征。第二,我们讨论了在经常性网络基础上为TTS的任务进行排序的神经序列的必要性。研究了双向短期内存(LSTM)和门状经常性单元(GRU),并用于模拟像人类一样更自然的更自然声音的连续参数。评价结果证明,拟议的模型与其他传统方法相比,实现了语音合成的最先进的表现。