Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Tacotron-2, transforms text into high-quality speech. However, generating speech with natural prosody still remains a challenge. Yasuda et. al. show that unlike natural speech, Tacotron-2's encoder doesn't fully represent prosodic features (e.g. syllable stress in English) from characters, and result in flat fundamental frequency variations. In this work, we propose a novel carefully designed strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent, that help achieve more natural prosody. To this end, we use of a classifier to learn these features in an end-to-end fashion, and apply feature conditioning at three parts of Tacotron-2's Text-To-Mel Spectrogram: pre-encoder, post-encoder, and intra-decoder. Further, we show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech (vs. Tacotron-2), and allows the model to produce speech with more accurate pitch accent and stress patterns. Quantitative evaluations show that our formulation achieves higher fundamental frequency contour correlation, and lower Mel Cepstral Distortion measure between synthesized and natural speech. And subjective evaluation shows that the proposed method's Mean Opinion Score of 4.14 fairs higher than baseline Tacotron-2, 3.91, when compared against natural speech (LJSpeech corpus), 4.28.
翻译:Tacotron-2 等直系神经序列到序列 文本到语音合成(TTS), 如 Tacotron-2, 将文本转换成高质量的语言。 然而, 生成自然流动的言辞仍是一个挑战。 Yauda 等人 显示, 与自然言语不同, Taccotron-2 的编码器并不完全代表字符的预发性特征( 例如, 英文的可调频压力), 并导致简单的频率变化。 在这项工作中, 我们提出了一个经过精心精心设计的新型战略, 将Tacotron-2 调整成英语的两个基本分解特征 -- -- 压力感应和音调口音调, 帮助实现更自然的流动。 为此, 我们使用一个分类器来以端到端的方式学习这些特征, 并在Tacotron-2 的 Text- 至Mel Spectrotrogrogram: 预致电解、 后电解调、 内解变、 内解变、 我们展示前电解调前和内变调调调调调调调的调调调调调调的调调调调调调调的调调调调调调调调调调调调和调调调调调调和调和调调调调调调调调调调调调调调调调调调的调调调调调调调调调调调调调调调调调的调调调调调的调的调的调调调调调调的调。