Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.
 翻译:近期神经端对端 TTS 模型的进展显示,传统基于判决的 TTS 常规的 TTS 中,自然合成语言质量很高,然而,如果在TTS 中审议整个段落,需要考虑大量背景信息,以建立基于段落的 TTS 模型。为了减轻培训的困难,我们建议通过考虑交叉指令和嵌入式培训结构来模拟语言和解释信息。三个子模块,包括语言认知、亲善和判决定位网络,经过修改的Tacotron2 培训,以学习段落中所含的信息和相应部分句子之间的关系,仍然具有同样的高质量。我们使用语言认知和亲善的TTTS网络需要考虑大量背景信息。一个段落中的信息通过多头关注机制来学习。一个段落中的相对句子位置,包括语言认知、代理认知和句子定位网络。一个讲述故事的音频-视频(4.08小时),具体来说,要学习一段内容较长的信息,用一个女性语言-觉觉觉觉觉觉觉的文字测试,用更清晰的文字,用在中,从中,从中可以理解的句子语言上的文字的句子变,用更清楚的句子,用新的句子,用,用在中可以证实,用新的句子的句子的句子的句子的句子的句子的句子的句子的句子的句子,用在逻辑上,用,用比的句子的句子的句子,用,用更精确的句子,用,用,用,用,用,用,用更精确的句子的句子的句子的句子的句子的句子,用在中,用,用更精确的句子的句子的句子的句子可以用在中,用在逻辑的句子,用在中,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用,用