Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrapping speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.
翻译:发音在表达演讲者的意图方面起着重要作用。 但是,当前的端到端 TTS 系统往往无法模拟正确的进化。 为了缓解这一问题,我们提出了一个创新的直观方法,用预先定义的内化模板合成不同内化的语句。在TTS 模式培训之前,语音数据以不受监督的方式分组为内化模板。在端到端 TTS 框架中添加了两个拟议模块:一个进化预测器和一个进化编码器。进化预测器建议为给定文本提供一个适合的进化模板。文本编码输出附有的进化编码器,综合符合要求的内化模板。我们论文的主要贡献是:(a) 一种便于使用的内化控制系统,涵盖广泛的用户;(b) 在请求的内嵌入式中改进客观和主观评价;以及(c) 将预先培训的语言模型纳入进化模型。在 https://srtts.github.io/IntoTTT。