The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody. Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack simple control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis. Three candidate features for the latent space are compared: 1) Variance of pitch and duration within words in a sentence, 2) a wavelet based feature computed from pitch, energy, and duration and 3) a learned combination of the above features. Objective measures reveal that the proposed methods are able to achieve a wide range of emphasis modification, and subjective evaluations on the degree of emphasis and the overall quality indicate that they show promise for real-world applications.
翻译:语音信号所传递的语义信息受到当地语言变异的强烈影响。最近的平行神经文字对语音合成方法能够在保持高性能的同时产生高度忠诚的言语,但这些系统往往缺乏对输出流体的简单控制,从而限制了为某一文本传递的语义信息。本文件建议通过学习一个与重点变化直接对应的潜在空间来进行分级平行神经 TTS系统,以进行分层控制。对潜伏空间的三个候选特征进行了比较:(1) 句子内音位和持续时间的差异;(2) 基于波段的特征根据音道、能量和持续时间计算;以及(3) 以上特征的学习组合。客观措施表明,拟议方法能够实现广泛的强调修改,对强调程度和总体质量的主观评价表明,它们显示了对现实世界应用的希望。