With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.
翻译:随着深层学习的到来,产生了大量产生类似人类语言的文本到语音模型。最近,通过引入语义和语义信息输入文本,提出了丰富TTS模型的自然性和表达性的各种办法。虽然这些战略显示了令人印象深刻的结果,但在使用语言信息方面仍然有一些限制。首先,大多数方法只使用图形网络来使用合成和语义信息而不考虑语言特征。第二,大多数前工作在编码合成和语义信息时没有明确考虑相邻的词,尽管在编码当前词义时,相邻的词句通常有意义。为了解决这些问题,我们建议采用“通识文字编码网络”,有效地允许基于两个模块(即语义级连接和相近的文字连接编码)的合成和语义信息。实验结果显示,与以往的作品相比,实验结果显示出了实质性的改进。