This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.
翻译:本文提出一个表达式语音综合模型,利用象征性潜伏变量捕捉和控制语音级别属性,例如角色表现声音和语音风格。当前工作的目的是将这类细微和发音级别的语音属性明确纳入模块在相应层面运作的模块所抽取的不同表达式。我们显示,细微的潜伏空间也捕捉粗略的暗藏空间信息,随着潜在空间的维度增加以捕捉各种预想表达式而更加明显。因此,在象征性级别和发音级别表达式的多样性及其分解之间产生了一种权衡。我们首先将丰富的语音属性捕获到象征性水平的潜在空间,然后单独培训一个具有投入文本的先前网络,学习直言层表达式,以预测前一步中提取的电话级别和后层潜伏层。定性和定量评价都用来证明拟议方法的有效性。音频样本见我们的演示页。