End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural transduction (SSNT), directly calculates the joint probability of acoustic features and alignment given an input text. The alignment is designed to be hard and monotonically increase by considering the speech nature, and it is treated as a latent variable and marginalized during training. During prediction, both the alignment and acoustic features can be generated from the probabilistic distributions. The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function. As far as we know, our approach is the first end-to-end TTS without a soft attention mechanism.
翻译:端到端文本到语音合成(TTS)合成是一种方法,它直接将输入文字转换成使用单一网络的声学功能。最近端到端 TTS的进步是由于一种关键技术,即关注机制,而迄今提出的所有成功方法都以软关注机制为基础。然而,尽管网络结构日益复杂,但带有软关注机制的端到端 TTS系统可能仍然无法学习,也无法预测输入和输出之间的准确匹配。这可能是因为软关注机制过于灵活。因此,我们建议一种方法,它具有更明确但自然的限制,适合于语音信号,使端到端到端 TTTS系统的统一学习和预测更加有力。拟议的系统,由于从段到隔离感应转换(SSNTT)中借用了有限的调整计划,直接计算了音学特征和对齐输入文本的共同概率。考虑到演讲的性质,这种调整的目的是困难和单质的增加,而且它被视为一种潜在的变量和边缘化。在培训期间,我们预测期间,调整和声学特性的特性既适合更明确,也适合使端到终端 TTTTTS系统系统系统系统系统系统系统系统更加可靠。 利用简化的稳定性分配功能,因此,我们可以通过简化的稳定性分配模式和最终分配方式产生一种优势。