This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
翻译:本文介绍了一种方法,用于对多发方言文本到语音设置的F0和持续时间进行电话上流制控制,该方法基于先发制人组群。使用了一种以自动递进式注意力为基础的模型,将多发方言结构模块与编解器平行纳入其中。提议对基本的单发方言方法进行若干改进,从而增加预发方控范围及覆盖范围。更具体地说,我们采用了数据增强、F0正常化、持续时间平衡组合和自发式待发方言组合。这些修改使培训组群中所有发言者都能够进行细微的语音到语音到音级的模拟控制,同时保持发言者的身份。该模型还精确地适应了数据数量有限的隐蔽式发言者,并显示它能够保持其主动式控制能力,核查依赖发言者的先发方言组群是否有效。实验结果证实,模型保持高输出语音质量,而且拟议的方法允许每个发言者范围内的有效演算控制,尽管多发方言人的设置具有变性。