Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
翻译:基于关注端到端文本到语音合成(TTS)在许多方面优于常规统计方法。基于变换器的 TTS 是成功执行的其中之一。 虽然变换器 TTS 将语音框架序列与自我注意机制进行模型化, 但是它并不将输入文本与句级综合观点的输出音量联系起来。 我们提议了一个新的神经TS 模型, 称为GreaphSpeech, 在图形神经网络框架下制作。 图形Speech 编码明确了输入词汇符号在句子中的合成关系, 并结合了这种信息为 TTS 注意机制生成具有同步动机的字符嵌入。 实验显示, 图形Speach 在频谱和预演化表达方面始终高于变器 TTS 基线 。