图词:神经语音合成语法提醒图关注网络 (GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis)

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.

翻译：基于关注端到端文本到语音合成(TTS)在许多方面优于常规统计方法。基于变换器的 TTS 是成功执行的其中之一。虽然变换器 TTS 将语音框架序列与自我注意机制进行模型化, 但是它并不将输入文本与句级综合观点的输出音量联系起来。我们提议了一个新的神经TS 模型, 称为GreaphSpeech, 在图形神经网络框架下制作。图形Speech 编码明确了输入词汇符号在句子中的合成关系, 并结合了这种信息为 TTS 注意机制生成具有同步动机的字符嵌入。实验显示, 图形Speach 在频谱和预演化表达方面始终高于变器 TTS 基线。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。